Closed penguin-jeff closed 8 months ago
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 69027 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
Just retry, NCCL is a bit picky at times. But this isn't an issue in TGI itself. Try not running docker in daemon mode maybe ? (Maybe there's an old NCCL process still up somewhere messing up with the current process)
After running multiple times I ended up receiving a slightly different error message with an extra line thrown in, but I'm not sure how much of help this may be.
2023-10-05T23:40:07.398435Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
-> [W socket.cpp:601] [c10d] The client socket has failed to connect to [localhost]:29500 (errno: 99 - Cannot assign requested address).
You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 69108 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 69108 milliseconds before timing out. rank=1
-> [W socket.cpp:601] [c10d] The client socket has failed to connect to [localhost]:29500 (errno: 99 - Cannot assign requested address).
This is some process that's already using that port. Something is not cleaning up properly when it finishes it your setup.
You can try playing around with those settings to find an used port: https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#shardudspath
(Probably better to clean whatever is using those ports if you can though)
@penguin-jeff Hi, I am getting the same error but with WizardLM/WizardLM-70B-V1.0
.
Got any solution ?
first error
INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 68089 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 68089 milliseconds before timing out. rank=2
ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 6 rank=2
ERROR text_generation_launcher: Shard 2 failed to start
INFO text_generation_launcher: Shutting down shards
ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
Error 2
You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 68122 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 68122 milliseconds before timing out. rank=3
ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 6 rank=3
also tried with by setting different --master-port
still got the error2
Helping to bump the issue. I am also unable to load Llama-2-7b-chat with 4 shards. Strangely enough, 1-2 shards works fine. This is on a node with 4xA6000s.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Strangely enough, 1-2 shards works fine. This is on a node with 4xA6000s.
Really seems like an issue with your provider/host. Contact them I think.
System Info
TGI Version - 1.0.3 Model - meta-llama/Llama-2-7b-chat-hf
AWS instance - g5.12xlarge OS version - Ubuntu 22.04.3 LTS
NVIDIA SMI NVIDIA-SMI 535.113.01 Driver Version: 535.113.01 CUDA Version: 12.2 GPU - NVIDIA A10G (x4)
Docker Docker version 24.0.6, build ed223bc
Information
Tasks
Reproduction
Run the following docker command on a g5.12xlarge in AWS
Error logs after running
docker logs container_id
Expected behavior
The model is sharded across 2 GPUs and the server starts.
Note: This was working for me until around September 28th for some reason. I have also tried this with 1.1.0 with the same issue as well as on a new g5.12xlarge instance.