Dynamic loading - different models at request time / multiple models

cduk commented 3 months ago

Instead of running an instance per model in the dockerfile. Can a list of models be provided at instantiation and then the model is chosen via the api request. The current API already has model as a parameter.

michaelfeil commented 3 months ago

Interesting idea:

What parameters would you launch the model with (always the same?)
Would you prefer to launch multiple models at a time?
How long would you keep a model "active" before "unloading it"?
What revision?
What to do if a user e.g. requests a onnx repo, but the requested model has e.g. no onnx files?

/models -> List all current models {"BAAI/bge":""} 
/embedding ->Check if  "BAAI/bge" is the list of models. Do not deploy dynamically.
/rerank
/state/load -> "jinaai/embed-v2" -> add to models, add max dynamic ones to
/state/unload -> Chan

Idea: Do not add inside /embedding -> That would be a huge mess. Perhaps Drawbacks:

what happens with unload if there is requests in process?
It's hard to preserve the state -> This would be STATEFUL -> How to du that in k8s? What happens if you have a load balancer? What about multiple replicas?

Summary: If this comment gets 10 upvotes, and no futher concerns, I'll build it. Its a heavyweight feature, that I would prefer to move in a separate service.

cduk commented 3 months ago

The simpler way would be not do deal with loading and unloading and require all models fit in VRAM and then you select which one you use in the API call.

michaelfeil commented 3 months ago

So basically add multiple models in the cli at startup?

cduk commented 3 months ago

Exactly!

michaelfeil / infinity

Dynamic loading - different models at request time / multiple models #151