Excessive use of VRAM for Llama 3.1 8B

System Info

text-generation-inference:2.3.0, deployed on docker
model info: { "model_id": "meta-llama/Llama-3.1-8B-Instruct", "model_sha": "0e9e39f249a16976918f6564b8830bc894c89659", "model_pipeline_tag": "text-generation", "max_concurrent_requests": 128, "max_best_of": 2, "max_stop_sequences": 4, "max_input_tokens": 5000, "max_total_tokens": 6024, "validation_workers": 2, "max_client_batch_size": 4, "router": "text-generation-router", "version": "2.3.1-dev0", "sha": "169178b937d0c4173b0fdcd6bf10a858cfe4f428", "docker_label": "sha-169178b" }
ubuntu 22.04
4x cards Nvidia L40S 48GB, Driver Version: 560.35.03, CUDA Version: 12.6

Information

[X] Docker
[ ] The CLI directly

Tasks

[ ] An officially supported command
[ ] My own modifications

Reproduction

Steps to reproduce:

Run docker compose file:
```
services:
tgi:
container_name: tgi
image: ghcr.io/huggingface/text-generation-inference:2.3.0
restart: always
deploy:
  resources:
    reservations:
      devices:
      - driver: nvidia
        count: all
        capabilities:
          - gpu
shm_size: '192gb'
ports:
  - 6500:80
environment:
  - HF_TOKEN=<your-hf-token>
  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
  - SHARDED=true
  - NUM_SHARD=4
  - MAX_BATCH_SIZE=1
  - CUDA_MEMORY_FRACTION=1
  - MAX_INPUT_TOKENS=5000
  - MAX_TOTAL_TOKENS=6024
```
Output logs: 2024-10-07T06:30:47.292774Z INFO text_generation_launcher: Args { model_id: "meta-llama/Meta-Llama-3.1-8B-Instruct", revision: None, validation_workers: 2, sharded: Some( true, ), num_shard: Some( 4, ), quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: Some( 5000, ), max_input_length: None, max_total_tokens: Some( 6024, ), waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some( 1, ), cuda_graphs: None, hostname: "eeb1ec72b169", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], api_key: Some( "xxx", ), watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4, lora_adapters: None, usage_stats: On, } 2024-10-07T06:30:47.293479Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching true 2024-10-07T06:30:47.293484Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 5000 2024-10-07T06:30:47.293487Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32] 2024-10-07T06:30:47.293489Z INFO text_generation_launcher: Sharding model on 4 processes 2024-10-07T06:30:47.293556Z INFO download: text_generation_launcher: Starting check and download process for meta-llama/Meta-Llama-3.1-8B-Instruct 2024-10-07T06:30:49.757621Z INFO text_generation_launcher: Files are already present on the host. Skipping download. 2024-10-07T06:30:50.197536Z INFO download: text_generation_launcher: Successfully downloaded weights for meta-llama/Meta-Llama-3.1-8B-Instruct 2024-10-07T06:30:50.197749Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2024-10-07T06:30:50.197801Z INFO shard-manager: text_generation_launcher: Starting shard rank=1 2024-10-07T06:30:50.198341Z INFO shard-manager: text_generation_launcher: Starting shard rank=2 2024-10-07T06:30:50.198363Z INFO shard-manager: text_generation_launcher: Starting shard rank=3 2024-10-07T06:30:52.534518Z INFO text_generation_launcher: Using prefix caching = True 2024-10-07T06:30:52.534550Z INFO text_generation_launcher: Using Attention = flashinfer 2024-10-07T06:31:00.208692Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1 2024-10-07T06:31:00.209210Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-10-07T06:31:00.209701Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2 2024-10-07T06:31:00.209782Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3 2024-10-07T06:31:03.994463Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1 2024-10-07T06:31:04.012487Z INFO shard-manager: text_generation_launcher: Shard ready in 13.812430798s rank=1 2024-10-07T06:31:04.291933Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0 2024-10-07T06:31:04.292206Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-3 2024-10-07T06:31:04.292206Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-2 2024-10-07T06:31:04.313117Z INFO shard-manager: text_generation_launcher: Shard ready in 14.113306239s rank=0 2024-10-07T06:31:04.313524Z INFO shard-manager: text_generation_launcher: Shard ready in 14.113391818s rank=3 2024-10-07T06:31:04.313770Z INFO shard-manager: text_generation_launcher: Shard ready in 14.113394813s rank=2 2024-10-07T06:31:04.411975Z INFO text_generation_launcher: Starting Webserver 2024-10-07T06:31:04.490925Z INFO text_generation_router_v3: backends/v3/src/lib.rs:90: Warming up model 2024-10-07T06:31:05.160689Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1] 2024-10-07T06:31:06.175895Z INFO text_generation_router_v3: backends/v3/src/lib.rs:102: Setting max batch total tokens to 1084130 2024-10-07T06:31:06.175942Z INFO text_generation_router_v3: backends/v3/src/lib.rs:126: Using backend V3 2024-10-07T06:31:06.175988Z INFO text_generation_router::server: router/src/server.rs:1797: Using the Hugging Face API 2024-10-07T06:31:06.908180Z INFO text_generation_router::server: router/src/server.rs:2515: Serving revision 0e9e39f249a16976918f6564b8830bc894c89659 of model meta-llama/Llama-3.1-8B-Instruct 2024-10-07T06:31:08.905070Z INFO text_generation_router::server: router/src/server.rs:1943: Using config Some(Llama) 2024-10-07T06:31:08.905115Z WARN text_generation_router::server: router/src/server.rs:2090: Invalid hostname, defaulting to 0.0.0.0 2024-10-07T06:31:08.954949Z INFO text_generation_router::server: router/src/server.rs:2477: Connected

Expected behavior

Since I have specified in env variables MAX_TOTAL_TOKENS=6024 and MAX_BATCH_SIZE=1 I would expect the total tokens max batch total tokens to be 6024. Instead as can be seen in the logs the inferred max batch total tokens is set to be 1 084 130 and the VRAM usage goes up to 160GB! According to my calculations (based on this article), model should use 16GB of memory plus extra 3GB for 6024 tokens - 0,5MiB for each token for this particular model, correct me if I'm wrong.

To sum up:

Expected VRAM Usage: 20GB Actual VRAM Usage: 160GB

What can be a cause for such behavior? Am I doing something wrong or is it a bug?

huggingface / text-generation-inference