Open source-ram opened 1 month ago
You can try using enforce-eager
to verify if the issue is caused by cudagraph
For vllm=0.6.3, not only does inference with text exceeding 32K tokens result in garbled output, but garbled output can also occur with inputs around 1K tokens in high concurrency scenarios. This issue persists even when using enforce-eager
. However, when testing the problematic prompt separately with the same parameters, no garbled output is produced.
if you run into this issue in eager mode
as well, then it might not be due to that reason(cudagraph). BTW, perhaps you can refer to :https://github.com/vllm-project/vllm/issues/9581#issuecomment-2428535045 to catch the root cause
same behaviour with llama 3.1 70B 128k context and llama 3.2 3B 128K context
Was this tried in 0.6.4?
Your current environment
Running via Docker
```text docker run --runtime nvidia --gpus \"device=${CUDA_VISIBLE_DEVICES}\" --shm-size 8g -v $volume:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=***" -p 5005:5005 --ipc=host vllm/vllm-openai:v0.6.3.post1 --model nvidia/Llama-3.1-Nemotron-70B-Instruct-HF --trust-remote-code --tensor-parallel-size 4 --port 5005 ```Model Input Dumps
No response
🐛 Describe the bug
Note : The issue is not seen in release v0.6.2
From release 0.6.3, any input larger than 32K tokens, the model out is garbage. Model deployed is : nvidia/Llama-3.1-Nemotron-70B-Instruct-HF with tensor-parallel-size 4 on 4 A100 GPU servers When i rolled back to 0.6.2 release the issue disappeared & the model is stable still 130K input token without any issue.
Before submitting a new issue...