Open zhihui96 opened 1 month ago
also get same bug on 4 A10 card with Qwen2-72B-Instruct-GPTQ-Int4 , --gpu-memory-utilization=0.9 --enable-prefix-caching
We pushed a hotfix for the published image, please re-pull/deploy.
For the other installation, you can get around it by using VLLM_ATTENTION_BACKEND=XFORMERS
.
This is also addressed in main branch, and we will push out a fix release soon.
It helps for me, thanks.
@simon-mo can you link the PR for the hotfix? https://github.com/vllm-project/vllm/pull/5476?
Your current environment
🐛 Describe the bug
I ran vllm in k8s with the following yaml
Then I conducted throughput testing on it, and got the following error. The error occurs sporadically, and I cannot consistently reproduce it.