vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.71k stars 4.66k forks source link

[Bug]: Input length greater than 32K in nvidia/Llama-3.1-Nemotron-70B-Instruct-HF generate garbage on v0.6.3 ( issue is not seen in v0.6.2) #9670

Open source-ram opened 1 month ago

source-ram commented 1 month ago

Your current environment

Running via Docker ```text docker run --runtime nvidia --gpus \"device=${CUDA_VISIBLE_DEVICES}\" --shm-size 8g -v $volume:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=***" -p 5005:5005 --ipc=host vllm/vllm-openai:v0.6.3.post1 --model nvidia/Llama-3.1-Nemotron-70B-Instruct-HF --trust-remote-code --tensor-parallel-size 4 --port 5005 ```

Model Input Dumps

No response

🐛 Describe the bug

Note : The issue is not seen in release v0.6.2

From release 0.6.3, any input larger than 32K tokens, the model out is garbage. Model deployed is : nvidia/Llama-3.1-Nemotron-70B-Instruct-HF with tensor-parallel-size 4 on 4 A100 GPU servers When i rolled back to 0.6.2 release the issue disappeared & the model is stable still 130K input token without any issue.

Before submitting a new issue...

jeejeelee commented 1 month ago

You can try using enforce-eager to verify if the issue is caused by cudagraph

HuggingAha commented 1 month ago

For vllm=0.6.3, not only does inference with text exceeding 32K tokens result in garbled output, but garbled output can also occur with inputs around 1K tokens in high concurrency scenarios. This issue persists even when using enforce-eager. However, when testing the problematic prompt separately with the same parameters, no garbled output is produced.

jeejeelee commented 1 month ago

if you run into this issue in eager mode as well, then it might not be due to that reason(cudagraph). BTW, perhaps you can refer to :https://github.com/vllm-project/vllm/issues/9581#issuecomment-2428535045 to catch the root cause

sir3mat commented 1 month ago

same behaviour with llama 3.1 70B 128k context and llama 3.2 3B 128K context

noamgai21 commented 3 days ago

Was this tried in 0.6.4?