vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.81k stars 3.93k forks source link

hope that we can use multi-GPU directly in vllm for BitAndBytes quantization #7063

Open jiangchengchengark opened 1 month ago

jiangchengchengark commented 1 month ago

🚀 The feature, motivation and pitch

I am working on the quantization scheme of the large model BitAndBytes, the quantization is very smooth when using transformers, but the inference speed is still not ideal, I want to try to deploy the quantization model with VLLM, and found that VLLM integrates BitAndBytes, but I have trouble in actual operation, I use Kaggle dual T4GPU for model quantization, and I have to divide its weights into two GPUs to store when using the Baichuan-7B modelIt seems that VLLM doesn't support BNB quantization for multi-GPU for the time being, although I could use other methods to get the BNB model and then use VLLM to load its checkpoints. 屏幕截图 2024-08-02 120641

Alternatives

No response

Additional context

No response

junzhang-zj commented 1 month ago

Did you solve it? I'm also stuck on running bnb's model with multiple GPUs.

jvlinsta commented 1 month ago

This is a must-have feature ...

jiangchengchengark commented 1 month ago

你解决了吗?我也在使用多个 GPU 运行 bnb 模型时遇到问题。

At present, pytorch's BNB quantization integration can be used to save the quantized model and then import it into vllm. As of the time of this issue, vllm does not seem to support direct BNB quantization of multiple GPUs within the vllm framework

jvlinsta commented 1 month ago

Well, what if I have already quantized the model and just want to run it on multi-gpu? So the basic QLoRA setup essentially...

junzhang-zj commented 4 weeks ago

@jiangchengchengark Yes, I saved the quantized model with BNB but can not load it with multi-gpus on VLLM.

jvlinsta commented 2 weeks ago

@robertgshaw2-neuralmagic