Open pseudotensor opened 1 week ago
nvidia-smi:
ubuntu@h2ogpt-a100-node-1:~$ nvidia-smi
Mon Oct 28 19:41:45 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:0F:00.0 Off | 0 |
| N/A 43C P0 69W / 400W | 69883MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:15:00.0 Off | 0 |
| N/A 41C P0 71W / 400W | 69787MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:50:00.0 Off | 0 |
| N/A 41C P0 72W / 400W | 69787MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:53:00.0 Off | 0 |
| N/A 41C P0 67W / 400W | 69499MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA A100-SXM4-80GB On | 00000000:8C:00.0 Off | 0 |
| N/A 68C P0 332W / 400W | 77735MiB / 81920MiB | 96% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA A100-SXM4-80GB On | 00000000:91:00.0 Off | 0 |
| N/A 60C P0 318W / 400W | 77639MiB / 81920MiB | 92% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA A100-SXM4-80GB On | 00000000:D6:00.0 Off | 0 |
| N/A 63C P0 331W / 400W | 77639MiB / 81920MiB | 93% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA A100-SXM4-80GB On | 00000000:DA:00.0 Off | 0 |
| N/A 72C P0 331W / 400W | 77351MiB / 81920MiB | 94% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1815338 C /usr/bin/python3 69864MiB |
| 1 N/A N/A 1815472 C /usr/bin/python3 69768MiB |
| 2 N/A N/A 1815473 C /usr/bin/python3 69768MiB |
| 3 N/A N/A 1815474 C /usr/bin/python3 69480MiB |
| 4 N/A N/A 1980777 C /usr/bin/python3 77716MiB |
| 5 N/A N/A 1981060 C /usr/bin/python3 77620MiB |
| 6 N/A N/A 1981061 C /usr/bin/python3 77620MiB |
| 7 N/A N/A 1981062 C /usr/bin/python3 77332MiB |
+-----------------------------------------------------------------------------------------+
The other 4 GPUs are doing Qwen VL 2 76B
ubuntu@h2ogpt-a100-node-1:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
78dce1c637ec vllm/vllm-openai:latest "python3 -m vllm.ent…" 27 hours ago Up 27 hours qwen25_72b
d2918b1209aa vllm/vllm-openai:latest "python3 -m vllm.ent…" 4 weeks ago Up 5 days qwen72bvll
Even after restarting the docker image, I get back the same result.
So the above script is a fine repro. It isn't the only way of course, all our longer inputs fail with 0.6.3.post1.
Note this model is extremely good competitive model for coding and agents, so really needs to be top citizen for vLLM team in terms of testing etc.
I just posted a similar issue but with totally different params. I wonder if related at all: issue
Face similar problems
I had issues with long context. They are related to the issue fixed in this PR: https://github.com/vllm-project/vllm/pull/9549 If you get better results with --enforce-eager then this is likely the culprit. I'm seeing several similar issues the past few days.
Got it, can try that if I want to upgrade again, but will stick to 0.6.2 for this model for now.
I fixed my nonsense issue by installing the latest dev version of vLLM https://github.com/vllm-project/vllm/issues/9732#issuecomment-2444769412
Maybe that fixes your issue too @pseudotensor
Same situation when processing 32K context input on qwen-2.5-7B. Works fine turning vllm back to 0.6.2
I have this problem when using AWQ and GPTQ. Adding --enforce-eager can solve it normally, but it will be slower.
The issue is resolved in main with this fix: https://github.com/vllm-project/vllm/pull/9549
You can install the nightly or use —enforce-eager until v0.6.4. You may be able to revert to 0.6.2 but I had issues with 0.6.2 due to a transformers change that breaks Qwen2.5 when you enable long context (>32k)
same problem
Your current environment
docker 0.6.3.post1 8*A100
Model Input Dumps
No response
🐛 Describe the bug
No such issues with prior vLLM 0.6.2.
Trivial queries work:
But longer inputs lead to nonsense only in new vllm:
qwentest1.py.zip
Gives:
Full logs from that running state. It was just running overnight and was running some benchmarks.
qwen25_72b.bad.log.zip
Related or not? https://github.com/vllm-project/vllm/issues/9732
Before submitting a new issue...