Open kalocide opened 3 months ago
I have the same issue running with 0.5.3.post1 and kv cache in fp8 but it went away when i switched back to full kv cache. I think they induced some bug with this version of vllm but if you have the memory, taking away the kv cache dtype arg might work for you as well.
I'll also be watching this issue to see if a fix comes out and if I was right that kv cache is the culprit.
Additional info: I tried three versions of llama3.1 (awq, gptq, and unquant) and all three suffered from this bug until I turned off fp8 kv cache. The other models I use with vllm were unaffected in 0.5.3.post1
@joe-schwartz-certara Can second this; unsetting the KV cache dtype did fix it when I tried earlier. Unfortunately, it does mean that I have to run the model with less than half the batch size (from 24 to 8 before it would fit), but I am doing very small-scale inference (me and a few friends) so it does not matter for me. I could imagine this bug would be frustrating in production though
@satin-spirit After my testing, my assumption is VLLM team must've fudged something small when integrating the newest models. They are speedy and smart and we will probably get a fix soon.
Same bug w/ microsoft/Phi-3-medium-4k-instruct when using fp8_e5m2, fp8_e4m3. Unsetting kv-cache-dtype works.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
🐛 Describe the bug
When running
gradientai/Llama-3-8B-Instruct-Gradient-1048k
, I get the following error. I haven't tried with other models, but it happens at anymax-model-len
.My CLI args:
Traceback: