vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.5k stars 4.23k forks source link

[Bug]: Regression for AWQ marlin kernels from v0.6.2 to v0.6.3 when using CUDA Graphs #9417

Open joennlae opened 8 hours ago

joennlae commented 8 hours ago

Your current environment

First of all: fantastic project :-) Thank you for everything.

I would like to fix this bug. But I just do not have the capacity now. So I just thought I would try to make a good bug report.

Model Input Dumps

No response

🐛 Describe the bug

If I run this model in v0.6.2:

vllm serve hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 -tp 4 --gpu-memory-utilization 0.90 --max-model-len 32768

All works well and good :-)

If I run it in v0.6.3

vllm serve hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 -tp 4 --gpu-memory-utilization 0.90 --max-model-len 32768 --enforce-eager

All works well and good with enforce eager :-)

If I drop the enforce-eager

vllm serve hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 -tp 4 --gpu-memory-utilization 0.90 --max-model-len 32768

I get random repetition on large prompts 6000+ token. Or if I do multiple request in parallel I get CUDA: illegal memory access

My guess is that there is something dynamic in the updated awq_marlin kernels.

My hunch (this is untested): #8973 but I am not fully understanding how my non MoE should be affected by this.

Before submitting a new issue...

leangermany commented 3 hours ago

We got the same issue, but it works with enforce-eager too.