huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.34k stars 941 forks source link

Configuable NCCL timeout #654

Closed tienthanhdhcn closed 2 months ago

tienthanhdhcn commented 11 months ago

Feature request

Is there a way to make the NCCL timeout configurable as we often get timeout problems with the starcoder model? [E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1288224, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 66470 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.

https://github.com/huggingface/text-generation-inference/blob/5a1512c0253e759fb07142029127292d639ab117/server/text_generation_server/utils/dist.py#L53

Motivation

That is to fix the timeout problem with NCCL

Your contribution

It is quite easy to make it configurable with an env variable e.g., (NCCL_TIMEOUT). If that is ok, I can create a PR.

OlivierDehaene commented 11 months ago

Increasing the timeout will only make you crash later, it will not fix the issue. When do you have this issue?

tienthanhdhcn commented 11 months ago

Thanks @OlivierDehaene , it is quite random, and no specific input leading to it. And the docker container is crashed after that.

OlivierDehaene commented 11 months ago

Usually NCCL timeouts because one of the shard OOMs and the other shard end up waiting indefinetly for the OOMed shard. Can you check if this is what is happening in your case? The solution then is to decrease max-batch-total-tokens.

rishu931997 commented 11 months ago

I faced this issue today as well. After digging around for sometime I found that setting environment variable NCCL_P2P_DISABLE=1 fixes the issue. Try and see if this works for you.

tienthanhdhcn commented 11 months ago

Thank you so much, let me try it

stefanobranco commented 10 months ago

I'm having the same issue, and I can't quite figure it out.

docker run --gpus '"device=0,1"' --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN -p 8000:8000 -v /mnt/machinelearning/:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id meta-llama/Llama-2-7b-chat-hf --sharded true

I'm a bit out of my depth here, but here's what I found so far:

Both (or all) GPUs get 100% utilization, get a bit increased memory utilization, but then never more, so it doesn't seems like it's an OOM issue

image

It happens for our H100 PCIe, unfortunately I have nothing else to compare to. From what I can tell, P2P should be working fine and throughput is high, so disabling it neither solves the issue nor does it seem sensible for us. I've tried setting various specific NCCL_P2P_LEVELS, with no success.

Model size seems to have no impact, as you can see this happens even on models that should easily fit a single GPU.

OlivierDehaene commented 10 months ago

What type of server is it? How are the H100s inter-connected?

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.