turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.59k stars 278 forks source link

[BUG] Appending-Runtime-LoRA-weights #656

Open royallavanya140 opened 6 days ago

royallavanya140 commented 6 days ago

OS

Linux

GPU Library

CUDA 12.x

Python version

3.12

Pytorch version

2.3.1

Model

mistral-v0.3-instruct

Describe the bug

i hosted llm using fastapi and accept the lora weights from the users but If I receive the weights when the model is bsy in generation. is there any way to edit the weights without disturbing current generation.

image

Reproduction steps

Expected behavior

LoRAs cannot be updated while there are jobs in the generator queue

Logs

No response

Additional context

No response

Acknowledgements

turboderp commented 6 days ago

Problem is that you can't batch forward passes with different LoRA settings. Applying a LoRA effectively changes the weights of the model. It's a temporary change via a low-rank overlay, but it's still effectively the same as swapping out the model for a different one. Which makes sense as long as there aren't any requests in the queue, but while requests are processing, I don't know how the framework should interpret that..?