[Feature]: BitsandBytes quantization with TP>1

🚀 The feature, motivation and pitch

Any QLoRA adapters trained on large checkpoints (e.g., 70B) are unusable as we cannot use TP>1 to shard the model over multiple GPUs. Therefore, resolving this would enable models that were trained with quantization, rather than having to rely on GPTQ and AWQ, which are applied post-hoc after training.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

vllm-project / vllm

[Feature]: BitsandBytes quantization with TP>1 #8197

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...