Closed godsakurapeng closed 6 months ago
hi, I want to run THUDM/chatglm-6b-int4 by vllm, but raising cuda oom error. Based on the log, it requires at least 10G gpu memory. Actually when I use huggingface transformers directly run this model, it only use 5G memory. Do you know why?
hi, I want to run THUDM/chatglm-6b-int4 by vllm, but raising cuda oom error. Based on the log, it requires at least 10G gpu memory. Actually when I use huggingface transformers directly run this model, it only use 5G memory. Do you know why?嗨,我想通过 vllm 运行 THUDM/chatglm-6b-int4,但引发 cuda oom 错误。根据日志,它至少需要 10G GPU 内存。其实我用 huggingface transformers 直接运行这个模型的时候,它只用了5G内存。你知道为什么吗?
https://github.com/vllm-project/vllm/issues/2176 maybe you can look at this issue
hi, I want to run THUDM/chatglm-6b-int4 by vllm, but raising cuda oom error. Based on the log, it requires at least 10G gpu memory. Actually when I use huggingface transformers directly run this model, it only use 5G memory. Do you know why?嗨,我想通过 vllm 运行 THUDM/chatglm-6b-int4,但引发 cuda oom 错误。根据日志,它至少需要 10G GPU 内存。其实我用 huggingface transformers 直接运行这个模型的时候,它只用了5G内存。你知道为什么吗?
2176 maybe you can look at this issue
THX, but it not works. I open a new issue, could you please have a look? https://github.com/vllm-project/vllm/issues/2338
Closing this as all the mentioned optimisations are either in progress or merged.
I find an artice Accelerating Generative AI with PyTorch II: GPT, Fast The optimization used in this article is as shown below I simply tried gpt-fast, the improvement is huge
ps: results of int4 is terrible
I'm curious, can these optimizations be used on vllm? I can see some discussion about these optimizations, but it doesn't look like they will be possible in the short term (because of some problems about vllm?)
torch.compile
+34% higher throughput? Compiled model with torch.compile, unfortunately without performance improvements
quantization
Add GPTQ support (I tried a version before but it didn't work well
Speculative Decoding
Speculative Decoding
vllm is a great project!! I really hope to see these optimizations in vllm. I also want to know the difficulties that still exist :)