michaelfeil / infinity

Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting a wide range of text-embedding models and frameworks.
https://michaelfeil.eu/infinity/
MIT License
975 stars 72 forks source link

Dynamic loading - different models at request time / multiple models #151

Open cduk opened 3 months ago

cduk commented 3 months ago

Instead of running an instance per model in the dockerfile. Can a list of models be provided at instantiation and then the model is chosen via the api request. The current API already has model as a parameter.

michaelfeil commented 3 months ago

Interesting idea:

/models -> List all current models {"BAAI/bge":""} 
/embedding ->Check if  "BAAI/bge" is the list of models. Do not deploy dynamically.
/rerank
/state/load -> "jinaai/embed-v2" -> add to models, add max dynamic ones to
/state/unload -> Chan

Idea: Do not add inside /embedding -> That would be a huge mess. Perhaps Drawbacks:

Summary: If this comment gets 10 upvotes, and no futher concerns, I'll build it. Its a heavyweight feature, that I would prefer to move in a separate service.

cduk commented 3 months ago

The simpler way would be not do deal with loading and unloading and require all models fit in VRAM and then you select which one you use in the API call.

michaelfeil commented 3 months ago

So basically add multiple models in the cli at startup?

cduk commented 3 months ago

Exactly!