vllm-project / llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Apache License 2.0
601 stars 50 forks source link

Why is the speed does not increase after compressed it? #852

Open liho00 opened 4 days ago

liho00 commented 4 days ago

https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_fp8

https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_int8

https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w4a16

i tried these example code to generate a new compressed checkpoint and load with vllm 0.6.3

python -m vllm.entrypoints.openai.api_server --served-model-name /home/llm-compressor/examples/quantization_w8a8_fp8/Llama-3.1-8B-Instruct-FP8 --model meta-llama/Llama-3.1-8B-Instruct --port 8000 --host 0.0.0.0 --tensor-parallel-size 8 --gpu-memory-utilization 0.98

base model: 215 tok/s compressed model: 205 tok/s

robertgshaw2-neuralmagic commented 4 days ago

It looks like you are running the FP16 model in your launch command

That being said, you are running a 3b model with tp=8. I do not think you will see much performance benefit from fp8 in this regime since the linear layers are very small in this setup

liho00 commented 4 days ago

It looks like you are running the FP16 model in your launch command

That being said, you are running a 3b model with tp=8. I do not think you will see much performance benefit from fp8 in this regime since the linear layers are very small in this setup

Sorry for the typo, its should be 8b model Llama-3.1-8B-Instruct-FP8

python -m vllm.entrypoints.openai.api_server --served-model-name Llama-3.1-8B-Instruct-FP8 --model /root/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 --port 8000 --host 0.0.0.0 --tensor-parallel-size 8 --gpu-memory-utilization 0.95 --dtype bfloat16 --quantization compressed-tensors

any idea to speed up with vllm for compressed the model? the ideal will be having low latency for the first token.

image Or it only speed up for large model for 70b only?

robertgshaw2-neuralmagic commented 4 days ago

One last question - is this running on an H100?

liho00 commented 4 days ago

One last question - is this running on an H100?

yeap, 8xh100 smx5,

can I add you in discord for further details sharing?

robertgshaw2-neuralmagic commented 1 day ago

One last question - is this running on an H100?

yeap, 8xh100 smx5,

can I add you in discord for further details sharing?

With an 8xh100, your system is very overpowered for running an 8b parameter model, so the e2e speedup from quantization is small (and we have not really tuned the fp8 kernels for matrices that are so skinny).

I would expect to see speedups on a 1xh100 for an 8b parameter scale though