huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.66k stars 1k forks source link

Tgi crash on multi GPUs #2207

Closed RohanSohani30 closed 4 days ago

RohanSohani30 commented 1 month ago

System Info

I am trying to run TGI on Docker using 8 GPUs with 16GB each (In-house server) . Docker works fine with using single GPU. My server crashes when using all GPUs. is there any other way to do it. PS. I Need to use all GPUs so I can load big models. using single GPU I can use small models with less max-input-lenght

Information

Tasks

Reproduction

  1. docker run --gpus all --name tgi --shm-size 1g --cpus="5.0" --rm --runtime=nvidia -e HUGGING_FACE_HUB_TOKEN=*** -p 8060:80 -v '$PATH':/data ghcr.io/huggingface/text-generation-inference --model-id meta-llama/Meta-Llama-3-8B --num-shard 8 --max-input-length 14000 --max-batch-prefill-tokens 14000 --max-total-tokens 16000

Expected behavior

INFO text_generation_router: router/src/main.rs:242: Using the Hugging Face API to retrieve tokenizer config INFO text_generation_router: router/src/main.rs:291: Warming up model WARN text_generation_router: router/src/main.rs:306: Model does not support automatic max batch total tokens INFO text_generation_router: router/src/main.rs:328: Setting max batch total tokens to 16000 INFO text_generation_router: router/src/main.rs:329: Connected

bwhartlove commented 1 month ago

Seeing a similar issue on my end.

Hugoch commented 1 month ago

@RohanSohani30 Can you share the output of TGI when it errors?

HoKim98 commented 1 month ago

I had the similer problem like #2192, and was able to solve it by trying --cuda-graphs 0 method like #2099. This obviously caused major performance problems, but it was at least a better option than being broken.

RohanSohani30 commented 1 month ago

@RohanSohani30 Can you share the output of TGI when it errors?

There are no errors but system is getting crashed while the Warming model.

Hugoch commented 1 month ago

Yeah seems related to CUDA graphs and a bug introduced in NCCL 2.20.5. Can you retry with the latest docker image as #2099 was merged?

RohanSohani30 commented 1 month ago

Yeah seems related to CUDA graphs and a bug introduced in NCCL 2.20.5. Can you retry with the latest docker image as #2099 was merged?

I am using the latest docker image. Still facing the same issue. I found one quick fix: when I use 2 or 4 GPUs, TGI image is running using --runtime=nvidia --env BUILD_EXTENSIONS=False --env NCCL_SHM_DISABLE=1 but here is a catch. let's say a model takes 11 GB of GPU memory to load on a single GPU. but when I use 2 GPUs it takes more than 11GB per GPU the total GPU would go above 25GB. I used --cuda-memory-fraction to limit GPU usage per GPU. I want to load a model across multiple GPUs so, I can load big models. I am missing something?

Hugoch commented 1 month ago

If disabling SHM solves the issue, it means that there is a problem on the way your system handles SHM. How much RAM do you have on the machine? If I understand well, loading the model on 1GPU takes 11G VRAM while you OOM when using 2 GPUs?

@HoKim98 does the latest Docker image made it work?

HoKim98 commented 1 month ago

@Hugoch It seems to be working! Had a 10-min stress testing and no errors were found.

RohanSohani30 commented 1 month ago

If disabling SHM solves the issue, it means that there is a problem on the way your system handles SHM. How much RAM do you have on the machine? If I understand well, loading the model on 1GPU takes 11G VRAM while you OOM when using 2 GPUs?

@HoKim98 does the latest Docker image made it work?

1TB RAM with 8*16G VRAM. while using 2 GPUs I am getting OOM if model is big (above 22B) . There is another scenario. when I am loading model using TGI CLI. I can load big models on all 8 GPUs without OOM. but tokens per sec are very low ...less than 1. using CLI model is distributed accross all GPUs.

Hugoch commented 1 month ago

Llama3-8B has a context of 8k, so you probably want to reduce max-total-tokens, and max-input-length. Try to set a low max-batch-total-tokens to check if you can load the model. If that's the case then you can incrementally increase it until OOM.

github-actions[bot] commented 1 week ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.