[Bug]: NCCL timed out during inference

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

30.11k stars 4.55k forks source link

[Bug]: NCCL timed out during inference #4653

Open enkiid opened 6 months ago

enkiid commented 6 months ago

Your current environment

Using:

vllm 0.4.1
nccl 2.18.1
pytorch 2.2.1

🐛 Describe the bug

During inference I sometimes get this error:

(RayWorkerWrapper pid=2376582) [rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50404, OpType=GATHER, NumelIn=8000, NumelOut=0, Timeout(ms)=600000) ran for 600327 milliseconds before timing out.

Havn't seen it in earlier versions of vllm, any thoughts?

Ch3ngY1 commented 6 months ago

The same issue, which occurs randomly on my dataset. vllm 0.4.1 torch 2.2.0+cu118

DefTruth commented 6 months ago

i have encountered the same issue, try --disable-custom-all-reduce and --enforce-eager is worked for me.

changyuanzhangchina commented 6 months ago

Please refer to https://github.com/vllm-project/vllm/issues/4430

--disable-custom-all-reduce = True
--enforce-eager = True (may be unnecessary)
update to the https://github.com/vllm-project/vllm/pull/4557

This three can solve the watchdog problem for me before this, nccl watchdog error happens several times per day, and now, it works well

yunfeng-scale commented 5 months ago

we're seeing this on 0.4.2 as well with mixtral 8x22b. --disable-custom-all-reduce resolves the problem.

yunfeng-scale commented 5 months ago

can we again disable custom all reduce by default?

syr-cn commented 5 months ago

i have encountered the same issue, try --disable-custom-all-reduce and --enforce-eager is worked for me.

Works for me! thanks a lot!

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!