vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.77k stars 3.92k forks source link

Can vllm become faster? #2327

Closed godsakurapeng closed 6 months ago

godsakurapeng commented 8 months ago

I find an artice Accelerating Generative AI with PyTorch II: GPT, Fast The optimization used in this article is as shown below image I simply tried gpt-fast, the improvement is huge

codellama-python-7b, 2xA10(24G) infer speed(token/s)
vllm fp16 45.2
gpt-fast fp16 66.5
gpt-fast int8 105.1
gpt-fast int4 145.9

ps: results of int4 is terrible

I'm curious, can these optimizations be used on vllm? I can see some discussion about these optimizations, but it doesn't look like they will be possible in the short term (because of some problems about vllm?)

torch.compile

+34% higher throughput? Compiled model with torch.compile, unfortunately without performance improvements

quantization

Add GPTQ support (I tried a version before but it didn't work well

Speculative Decoding

Speculative Decoding

vllm is a great project!! I really hope to see these optimizations in vllm. I also want to know the difficulties that still exist :)

yuxx0218 commented 8 months ago

hi, I want to run THUDM/chatglm-6b-int4 by vllm, but raising cuda oom error. Based on the log, it requires at least 10G gpu memory. Actually when I use huggingface transformers directly run this model, it only use 5G memory. Do you know why?

godsakurapeng commented 8 months ago

hi, I want to run THUDM/chatglm-6b-int4 by vllm, but raising cuda oom error. Based on the log, it requires at least 10G gpu memory. Actually when I use huggingface transformers directly run this model, it only use 5G memory. Do you know why?嗨,我想通过 vllm 运行 THUDM/chatglm-6b-int4,但引发 cuda oom 错误。根据日志,它至少需要 10G GPU 内存。其实我用 huggingface transformers 直接运行这个模型的时候,它只用了5G内存。你知道为什么吗?

https://github.com/vllm-project/vllm/issues/2176 maybe you can look at this issue

yuxx0218 commented 8 months ago

hi, I want to run THUDM/chatglm-6b-int4 by vllm, but raising cuda oom error. Based on the log, it requires at least 10G gpu memory. Actually when I use huggingface transformers directly run this model, it only use 5G memory. Do you know why?嗨,我想通过 vllm 运行 THUDM/chatglm-6b-int4,但引发 cuda oom 错误。根据日志,它至少需要 10G GPU 内存。其实我用 huggingface transformers 直接运行这个模型的时候,它只用了5G内存。你知道为什么吗?

2176 maybe you can look at this issue

THX, but it not works. I open a new issue, could you please have a look? https://github.com/vllm-project/vllm/issues/2338

hmellor commented 6 months ago

Closing this as all the mentioned optimisations are either in progress or merged.