TGI 2.0.3 fails to serve CodeLlama models that 2.0.1 supports

KCFindstr commented 1 month ago

System Info

Running a TGI 2.0.3 docker on a 8 NVIDIA_L4 VM. Command:

MODEL=codellama/CodeLlama-70b-Python-hf

docker run \
  -m 320G \
  --shm-size=40G \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e MODEL_ID=$MODEL \
  -e NUM_SHARD=8 \
  -e MAX_INPUT_TOKENS=1024 \
  -e MAX_TOTAL_TOKENS=2048 \
  -e MAX_BATCH_PREFILL_TOKENS=2048 \
  -e TRUST_REMOTE_CODE=true \
  -e JSON_OUTPUT=true \
  -e PORT=8080 \
  -p 7080:8080 \
  --runtime=nvidia \
  $IMAGE

$IMAGE is a TGI 2.0.3 docker image.

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Run the provided command.

meta-llama/Meta-Llama-3-70B also failed with similar error. Not sure about other models.
Get the following error:

torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1712608935911/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2395, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5\nncclUnhandledCudaError: Call to CUDA function failed.\nLast error:\nFailed to CUDA malloc 512 bytes

Is this a CUDA OOM error?

Expected behavior

The model is loaded successfully on 8 L4s with the same command and a TGI 2.0.1 container. Are there any changes to default settings that might have caused increased GPU RAM usage?

philschmid commented 1 month ago

Can you try with the latest available public container ghcr.io/huggingface/text-generation-inference:latest? And let us know if it is still exist?

KCFindstr commented 1 month ago

Can you try with the latest available public container ghcr.io/huggingface/text-generation-inference:latest? And let us know if it is still exist?

I still get a CUDA error with ghcr.io/huggingface/text-generation-inference:latest:

RuntimeError: CUDA error: out of memory

stefanobranco commented 1 month ago

Just on a hunch, but I've had some stability issues that I believe (I haven't had time to dig into it enough to rule out something on our end or consistently reproduce it) have to do with inter-GPU communication in these newest versions. The issues went away completely when I disabled cuda-graphs (--cuda-graphs 0), so maybe that's worth a try? It's really just a shot in the dark though at this point.

KCFindstr commented 1 month ago

Hi @stefanobranco , thanks for the suggestion. The model can be loaded with -e CUDA_GRAPHS=0 on the latest TGI docker. But I do see cuda graphs are enabled in 2.0.1 and the model can still be served correctly. I'm wondering if this is a temporary mitigation and if there is going to be a fix to reduce GPU memory usage?

drbh commented 3 weeks ago

Hi @KCFindstr, I’ve just tried to reproduce the issue on a device with 8 NVIDIA A10G GPUs, which have the same VRAM (8x24.1GB = 192.8GB). I can confirm that the OOM is caused by CUDA_GRAPHS attempting to allocate more space than TGI did prior to version 2.0.0. Is it possible that the version you’re using is older than 2.0.0 or that CUDA_GRAPHS is disabled?

I tested the codellama/CodeLlama-70b-Python-hf with the following versions: 1.4.5, 2.0.0, 2.0.1 and 2.0.4 (command used below). In all versions >=2.0.0 the container will OOM if the default CUDA_GRAPHS (1, 2, 4, 8, 16, 32) are used with that combination of model and hardware.

CUDA_GRAPHS became default in version 2.0.0 and attempts to allocate additional space in the initial warmup step. As noted above, by decreasing the number of graphs the container will load as expected.

In version 1.4.5 CUDA_GRAPHS were experimental, and can be enabled by setting ENABLE_CUDA_GRAPHS=true. It is possible to reproduce the OOM by enabling CUDA_GRAPHS in version 1.4.5 which suggest that the issue is related to the amount of space that CUDA_GRAPHS is attempting to load after loading in the model.

I don't think this is a regression but is a limitation to the hardware + model + configuration (using cuda graphs). However I think we need better messaging/smart defaults to avoid this case. Regarding reducing GPU usage; we're always trying to get the most out of the limited GPU space and am going to look more into ways that we can reduce overhead in general to allow for more optimizations

command used

MODEL=codellama/CodeLlama-70b-Python-hf
IMAGE=ghcr.io/huggingface/text-generation-inference:2.0.0

docker run \
  -m 320G \
  --shm-size=40G \
  -v /nvme0n1/Models/:/data \
  -e HUGGINGFACE_HUB_CACHE=/data \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e MODEL_ID=$MODEL \
  -e NUM_SHARD=8 \
  -e MAX_INPUT_TOKENS=1024 \
  -e MAX_TOTAL_TOKENS=2048 \
  -e MAX_BATCH_PREFILL_TOKENS=2048 \
  -e TRUST_REMOTE_CODE=true \
  -e JSON_OUTPUT=true \
  -e 
  -e PORT=8080 \
  -p 7080:8080 \
  --runtime=nvidia \
  $IMAGE

KCFindstr commented 3 weeks ago

@drbh Thanks for the investigation! However, I retried with TGI 2.0.1 and the model can still be loaded successfully, which is different from your observation.

{"timestamp":"2024-06-12T23:25:59.509912Z","level":"INFO","fields":{"message":"Args { model_id: \"codellama/CodeLlama-70b-Python-hf\", revision: None, validation_workers: 2, sharded: None, num_shard: Some(8), quantize: None, speculate: None, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: Some(1024), max_input_length: None, max_total_tokens: Some(2048), waiting_served_ratio: 1.2, max_batch_prefill_tokens: Some(2048), max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: \"ecdb3588f8ab\", port: 8080, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: true, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }"},"target":"text_generation_launcher"} {"timestamp":"2024-06-12T23:25:59.513288Z","level":"INFO","fields":{"message":"Token file not found \"/root/.cache/huggingface/token\"","log.target":"hf_hub","log.module_path":"hf_hub","log.file":"/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs","log.line":55},"target":"hf_hub"} {"timestamp":"2024-06-12T23:25:59.990069Z","level":"INFO","fields":{"message":"Using default cuda graphs [1, 2, 4, 8, 16, 32]"},"target":"text_generation_launcher"} {"timestamp":"2024-06-12T23:25:59.990095Z","level":"WARN","fields":{"message":"trust_remote_code is set. Trusting that model codellama/CodeLlama-70b-Python-hf do not contain malicious code."},"target":"text_generation_launcher"} {"timestamp":"2024-06-12T23:25:59.990101Z","level":"INFO","fields":{"message":"Sharding model on 8 processes"},"target":"text_generation_launcher"}

From the logs, it looks like CUDA graph is enabled. (Using default cuda graphs [1, 2, 4, 8, 16, 32])

This is the VRAM usage when the model is loaded and being served:

Wed Jun 12 23:28:55 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   76C    P0             36W /   72W |   21948MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L4                      Off |   00000000:00:04.0 Off |                    0 |
| N/A   77C    P0             37W /   72W |   21948MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L4                      Off |   00000000:00:05.0 Off |                    0 |
| N/A   73C    P0             34W /   72W |   21948MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L4                      Off |   00000000:00:06.0 Off |                    0 |
| N/A   77C    P0             35W /   72W |   21948MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA L4                      Off |   00000000:80:00.0 Off |                    0 |
| N/A   66C    P0             31W /   72W |   21948MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA L4                      Off |   00000000:80:01.0 Off |                    0 |
| N/A   76C    P0             37W /   72W |   21948MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA L4                      Off |   00000000:80:02.0 Off |                    0 |
| N/A   72C    P0             32W /   72W |   21948MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA L4                      Off |   00000000:80:03.0 Off |                    0 |
| N/A   76C    P0             36W /   72W |   21948MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

This is the TGI 2.0.1 image I used:

us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-0.ubuntu2204.py310:m121

Do you know if there are any other changes from 2.0.1 -> 2.0.3 that might lead to VRAM usage increase?

drbh commented 3 weeks ago

@KCFindstr thats very interesting! would you be able to share the all of the logs until the server is ready to receive requests? Also does the model run on 2.0.2?

I'm not aware of anything major at the moment but will take another look soon for anything that would impact the VRAM

KCFindstr commented 3 weeks ago

@drbh I tested the containers hosted on ghcr.io:

ghcr.io/huggingface/text-generation-inference:2.0.1: Succeeded
ghcr.io/huggingface/text-generation-inference:2.0.2: CUDA OOM

The logs from us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-0.ubuntu2204.py310:m121 are attached here: tgi_2_0_1.log

drbh commented 3 weeks ago

Thank you for sharing, I've started to look through the changes between the two versions and the issue is possibly related to a change in how we are masking the frequency penalty. However I cannot confirm this yet and am going to investigate further tomorrow. Will post an update soon.

drbh commented 2 weeks ago

Hi @KCFindstr, I've continued to debug the issue and cannot find a specific change within TGI that would use more memory in 2.0.2, although I have some findings/recommendations/ideas below.

Where the OOM happens and how to avoid it:

During warmup we attempt to allocate as many blocks of kv_cache memory. This step checks the available memory and estimates a number of blocks. The percent of the total hardware memory can be configured with --cuda-memory-fraction and by default is 1.0. It should be possible to load the model by decreasing this value to ~.9 (depending on your system)

Continuing the warmup process, after the blocks are allocated, cuda graphs are initialized, at this point if there is not enough memory for the graphs (because of the block allocation) TGI will OOM.

Depending on the model, the cuda graphs step will allocate more memory when initializing. In the case of codellama/CodeLlama-70b-Python-hf the amount of memory needed is greater than the leftover space after allocating the blocks.

The solution is to either decrease the optimistic kv_cache allocation with --cuda-memory-fraction or decrease the number of --cuda-graphs.

Personally, I'd recommend decreasing the --cuda-memory-fraction by small amounts until your GPU's are fully saturated and the model loads. On 8 A10G's I can run the following command and get ~99% (23.9GB) utilization of each GPU. This uses the default settings max_input_tokens=4095, max_total_tokens=4096, max_batch_prefill_tokens=4145 and cuda_graphs=[1, 2, 4, 8, 16, 32].

text-generation-launcher \
--model-id codellama/CodeLlama-70b-Python-hf \
--num-shard 8 \
--cuda-memory-fraction .93

In terms of the origin of the change, in 2.0.2, torch was updated from 2.1.1 to 2.3.0 in TGI, which includes many upstream changes that likely changed how and how much memory is allocated.

Would you kindly try reducing the memory fraction to fit the model on your hardware with cuda graphs?

Im going continue to explore the memory allocation and follow up with better error messages in the future, however I believe reducing this value should resolve this loading issue

KCFindstr commented 2 weeks ago

Thanks @drbh ! Setting --cuda-memory-fraction to 0.93 works. However IIUC this value should depend on the model length and total GPU memory available, so ideally shouldn't be a fixed value for every model and machine configuration?

stefanobranco commented 2 weeks ago

It may also be worth mentioning that torch 2.3.1 contains a few fixes for various memory leaks, e.g. https://github.com/pytorch/pytorch/pull/124238. From my understanding this mainly affects torch.compile, so it might not be relevant, but I'm also not knowledgeable enough to really estimate the impact of these fixes. Might be worth checking out if it has any impact though?

huggingface / text-generation-inference