vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.21k stars 4.17k forks source link

[Feature]: BitsandBytes quantization with TP>1 #8197

Open jvlinsta opened 1 month ago

jvlinsta commented 1 month ago

🚀 The feature, motivation and pitch

Any QLoRA adapters trained on large checkpoints (e.g., 70B) are unusable as we cannot use TP>1 to shard the model over multiple GPUs. Therefore, resolving this would enable models that were trained with quantization, rather than having to rely on GPTQ and AWQ, which are applied post-hoc after training.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

copasseron commented 3 days ago

feature added in v0.6.2 https://github.com/vllm-project/vllm/pull/8434.