vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.66k stars 4.47k forks source link

[Bug]: Nonsense output for Qwen2.5 72B after upgrading to latest vllm 0.6.3.post1 [REPROs] #9769

Open pseudotensor opened 1 week ago

pseudotensor commented 1 week ago

Your current environment

docker 0.6.3.post1 8*A100

docker pull vllm/vllm-openai:latest
docker stop qwen25_72b ; docker remove qwen25_72b
docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=4,5,6,7"' \
    --shm-size=10.24gb \
    -p 5001:5001 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -v "${HOME}"/.cache:$HOME/.cache/ -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name qwen25_72b \
     vllm/vllm-openai:latest \
        --port=5001 \
        --host=0.0.0.0 \
        --model=Qwen/Qwen2.5-72B-Instruct \
        --tensor-parallel-size=4 \
        --seed 1234 \
        --trust-remote-code \
        --max-model-len=32768 \
        --max-num-batched-tokens 131072 \
        --max-log-len=100 \
        --api-key=EMPTY \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.qwen25_72b.txt

Model Input Dumps

No response

🐛 Describe the bug

No such issues with prior vLLM 0.6.2.

Trivial queries work:

from openai import OpenAI

client = OpenAI(base_url='FILL ME', api_key='FILL ME')

messages = [
    {
        "role": "user",
        "content": "Who are you?",
    }
]

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-72B-Instruct",
    messages=messages,
    temperature=0.0,
    max_tokens=4096,
)

print(response.choices[0])

But longer inputs lead to nonsense only in new vllm:

qwentest1.py.zip

Gives:

Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='A\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text\n</text>\n\n</text>\n\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text>\n\n</text\n</text>\n\n</text\n</text\n\n</text\n</text>\n\n</text\n</text\n</text\n</text>\n\n</text\n</text\n</text>\n\n</text>\n\n</text\n\n</text\n</text\n</text\n</text>\n\n</text>\n\n</text\n</text>\n\n</text\n\n\n</text>\n\n</text\n</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text\n</text\n</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text\n</text</text</text\n</text</text\n</text</text</text\n</text\n</text\n</text>\n\n</text</text</text</text>\n\n</text>\n\n</text</text>\n\n</text>\n\n</text\n</text</text\n</text\n</text>\n\n</text\n</text>\n\n</text\n</text>\n\n</text\n</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text</text>\n\n</text</text</text</text</text</text>\n\n</text</text</text</text</text</text</text</text</text</text</text>\n\n</text>\n\n</text</text>\n\n</text</text</text</text</text>\n\n</text</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text\n</text>\n\n</text\n</text>\n\n</text>\n\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text>\n', refusal=None, role='assistant', function_call=None, tool_calls=[]), stop_reason=None)

Full logs from that running state. It was just running overnight and was running some benchmarks.

qwen25_72b.bad.log.zip

Related or not? https://github.com/vllm-project/vllm/issues/9732

Before submitting a new issue...

pseudotensor commented 1 week ago

nvidia-smi:

ubuntu@h2ogpt-a100-node-1:~$ nvidia-smi
Mon Oct 28 19:41:45 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:0F:00.0 Off |                    0 |
| N/A   43C    P0             69W /  400W |   69883MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  |   00000000:15:00.0 Off |                    0 |
| N/A   41C    P0             71W /  400W |   69787MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  |   00000000:50:00.0 Off |                    0 |
| N/A   41C    P0             72W /  400W |   69787MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  |   00000000:53:00.0 Off |                    0 |
| N/A   41C    P0             67W /  400W |   69499MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  |   00000000:8C:00.0 Off |                    0 |
| N/A   68C    P0            332W /  400W |   77735MiB /  81920MiB |     96%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  |   00000000:91:00.0 Off |                    0 |
| N/A   60C    P0            318W /  400W |   77639MiB /  81920MiB |     92%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  |   00000000:D6:00.0 Off |                    0 |
| N/A   63C    P0            331W /  400W |   77639MiB /  81920MiB |     93%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  |   00000000:DA:00.0 Off |                    0 |
| N/A   72C    P0            331W /  400W |   77351MiB /  81920MiB |     94%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   1815338      C   /usr/bin/python3                            69864MiB |
|    1   N/A  N/A   1815472      C   /usr/bin/python3                            69768MiB |
|    2   N/A  N/A   1815473      C   /usr/bin/python3                            69768MiB |
|    3   N/A  N/A   1815474      C   /usr/bin/python3                            69480MiB |
|    4   N/A  N/A   1980777      C   /usr/bin/python3                            77716MiB |
|    5   N/A  N/A   1981060      C   /usr/bin/python3                            77620MiB |
|    6   N/A  N/A   1981061      C   /usr/bin/python3                            77620MiB |
|    7   N/A  N/A   1981062      C   /usr/bin/python3                            77332MiB |
+-----------------------------------------------------------------------------------------+

The other 4 GPUs are doing Qwen VL 2 76B

ubuntu@h2ogpt-a100-node-1:~$ docker ps
CONTAINER ID   IMAGE                     COMMAND                  CREATED        STATUS        PORTS     NAMES
78dce1c637ec   vllm/vllm-openai:latest   "python3 -m vllm.ent…"   27 hours ago   Up 27 hours             qwen25_72b
d2918b1209aa   vllm/vllm-openai:latest                  "python3 -m vllm.ent…"   4 weeks ago    Up 5 days               qwen72bvll
pseudotensor commented 1 week ago

Even after restarting the docker image, I get back the same result.

So the above script is a fine repro. It isn't the only way of course, all our longer inputs fail with 0.6.3.post1.

pseudotensor commented 1 week ago

Note this model is extremely good competitive model for coding and agents, so really needs to be top citizen for vLLM team in terms of testing etc.

osilverstein commented 1 week ago

I just posted a similar issue but with totally different params. I wonder if related at all: issue

HoboRiceone commented 1 week ago

Face similar problems

cedonley commented 1 week ago

I had issues with long context. They are related to the issue fixed in this PR: https://github.com/vllm-project/vllm/pull/9549 If you get better results with --enforce-eager then this is likely the culprit. I'm seeing several similar issues the past few days.

pseudotensor commented 1 week ago

Got it, can try that if I want to upgrade again, but will stick to 0.6.2 for this model for now.

SinanAkkoyun commented 1 week ago

I fixed my nonsense issue by installing the latest dev version of vLLM https://github.com/vllm-project/vllm/issues/9732#issuecomment-2444769412

Maybe that fixes your issue too @pseudotensor

why11699 commented 6 days ago

Same situation when processing 32K context input on qwen-2.5-7B. Works fine turning vllm back to 0.6.2

frei-x commented 6 days ago

I have this problem when using AWQ and GPTQ. Adding --enforce-eager can solve it normally, but it will be slower.

cedonley commented 5 days ago

The issue is resolved in main with this fix: https://github.com/vllm-project/vllm/pull/9549

You can install the nightly or use —enforce-eager until v0.6.4. You may be able to revert to 0.6.2 but I had issues with 0.6.2 due to a transformers change that breaks Qwen2.5 when you enable long context (>32k)

xinfanmeng commented 2 days ago

same problem