Open royallavanya140 opened 1 month ago
Problem is that you can't batch forward passes with different LoRA settings. Applying a LoRA effectively changes the weights of the model. It's a temporary change via a low-rank overlay, but it's still effectively the same as swapping out the model for a different one. Which makes sense as long as there aren't any requests in the queue, but while requests are processing, I don't know how the framework should interpret that..?
@turboderp, to be clear would it be possible to append LoRA weights if you wait until there are no requests in cue? For the dynamic generator is it as simple as adding generator.set_loras(lora)
when the cue is empty, or are there additional considerations in play?
As for multiple Loras in runtime, it is theoretically possible and has been done by this library (https://github.com/S-LoRA/S-LoRA), this is how vllm lets you run multiple at runtime. However, integrating into exllama seems like a massive undertaking.
OS
Linux
GPU Library
CUDA 12.x
Python version
3.12
Pytorch version
2.3.1
Model
mistral-v0.3-instruct
Describe the bug
i hosted llm using fastapi and accept the lora weights from the users but If I receive the weights when the model is bsy in generation. is there any way to edit the weights without disturbing current generation.
Reproduction steps
Expected behavior
LoRAs cannot be updated while there are jobs in the generator queue
Logs
No response
Additional context
No response
Acknowledgements