Open jiangchengchengark opened 1 month ago
Did you solve it? I'm also stuck on running bnb's model with multiple GPUs.
This is a must-have feature ...
你解决了吗?我也在使用多个 GPU 运行 bnb 模型时遇到问题。
At present, pytorch's BNB quantization integration can be used to save the quantized model and then import it into vllm. As of the time of this issue, vllm does not seem to support direct BNB quantization of multiple GPUs within the vllm framework
Well, what if I have already quantized the model and just want to run it on multi-gpu? So the basic QLoRA setup essentially...
@jiangchengchengark Yes, I saved the quantized model with BNB but can not load it with multi-gpus on VLLM.
@robertgshaw2-neuralmagic
🚀 The feature, motivation and pitch
I am working on the quantization scheme of the large model BitAndBytes, the quantization is very smooth when using transformers, but the inference speed is still not ideal, I want to try to deploy the quantization model with VLLM, and found that VLLM integrates BitAndBytes, but I have trouble in actual operation, I use Kaggle dual T4GPU for model quantization, and I have to divide its weights into two GPUs to store when using the Baichuan-7B modelIt seems that VLLM doesn't support BNB quantization for multi-GPU for the time being, although I could use other methods to get the BNB model and then use VLLM to load its checkpoints.
Alternatives
No response
Additional context
No response