huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.92k stars 1.05k forks source link

Excessive use of VRAM for Llama 3.1 8B #2615

Open ukito-pl opened 1 week ago

ukito-pl commented 1 week ago

System Info

Information

Tasks

Reproduction

Steps to reproduce:

  1. Run docker compose file:
    services:
    tgi:
    container_name: tgi
    image: ghcr.io/huggingface/text-generation-inference:2.3.0
    restart: always
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            count: all
            capabilities:
              - gpu
    shm_size: '192gb'
    ports:
      - 6500:80
    environment:
      - HF_TOKEN=<your-hf-token>
      - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
      - SHARDED=true
      - NUM_SHARD=4
      - MAX_BATCH_SIZE=1
      - CUDA_MEMORY_FRACTION=1
      - MAX_INPUT_TOKENS=5000
      - MAX_TOTAL_TOKENS=6024

    Output logs: 2024-10-07T06:30:47.292774Z INFO text_generation_launcher: Args { model_id: "meta-llama/Meta-Llama-3.1-8B-Instruct", revision: None, validation_workers: 2, sharded: Some( true, ), num_shard: Some( 4, ), quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: Some( 5000, ), max_input_length: None, max_total_tokens: Some( 6024, ), waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some( 1, ), cuda_graphs: None, hostname: "eeb1ec72b169", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], api_key: Some( "xxx", ), watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4, lora_adapters: None, usage_stats: On, } 2024-10-07T06:30:47.293479Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching true 2024-10-07T06:30:47.293484Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 5000 2024-10-07T06:30:47.293487Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32] 2024-10-07T06:30:47.293489Z INFO text_generation_launcher: Sharding model on 4 processes 2024-10-07T06:30:47.293556Z INFO download: text_generation_launcher: Starting check and download process for meta-llama/Meta-Llama-3.1-8B-Instruct 2024-10-07T06:30:49.757621Z INFO text_generation_launcher: Files are already present on the host. Skipping download. 2024-10-07T06:30:50.197536Z INFO download: text_generation_launcher: Successfully downloaded weights for meta-llama/Meta-Llama-3.1-8B-Instruct 2024-10-07T06:30:50.197749Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2024-10-07T06:30:50.197801Z INFO shard-manager: text_generation_launcher: Starting shard rank=1 2024-10-07T06:30:50.198341Z INFO shard-manager: text_generation_launcher: Starting shard rank=2 2024-10-07T06:30:50.198363Z INFO shard-manager: text_generation_launcher: Starting shard rank=3 2024-10-07T06:30:52.534518Z INFO text_generation_launcher: Using prefix caching = True 2024-10-07T06:30:52.534550Z INFO text_generation_launcher: Using Attention = flashinfer 2024-10-07T06:31:00.208692Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1 2024-10-07T06:31:00.209210Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-10-07T06:31:00.209701Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2 2024-10-07T06:31:00.209782Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3 2024-10-07T06:31:03.994463Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1 2024-10-07T06:31:04.012487Z INFO shard-manager: text_generation_launcher: Shard ready in 13.812430798s rank=1 2024-10-07T06:31:04.291933Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0 2024-10-07T06:31:04.292206Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-3 2024-10-07T06:31:04.292206Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-2 2024-10-07T06:31:04.313117Z INFO shard-manager: text_generation_launcher: Shard ready in 14.113306239s rank=0 2024-10-07T06:31:04.313524Z INFO shard-manager: text_generation_launcher: Shard ready in 14.113391818s rank=3 2024-10-07T06:31:04.313770Z INFO shard-manager: text_generation_launcher: Shard ready in 14.113394813s rank=2 2024-10-07T06:31:04.411975Z INFO text_generation_launcher: Starting Webserver 2024-10-07T06:31:04.490925Z INFO text_generation_router_v3: backends/v3/src/lib.rs:90: Warming up model 2024-10-07T06:31:05.160689Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1] 2024-10-07T06:31:06.175895Z INFO text_generation_router_v3: backends/v3/src/lib.rs:102: Setting max batch total tokens to 1084130 2024-10-07T06:31:06.175942Z INFO text_generation_router_v3: backends/v3/src/lib.rs:126: Using backend V3 2024-10-07T06:31:06.175988Z INFO text_generation_router::server: router/src/server.rs:1797: Using the Hugging Face API 2024-10-07T06:31:06.908180Z INFO text_generation_router::server: router/src/server.rs:2515: Serving revision 0e9e39f249a16976918f6564b8830bc894c89659 of model meta-llama/Llama-3.1-8B-Instruct 2024-10-07T06:31:08.905070Z INFO text_generation_router::server: router/src/server.rs:1943: Using config Some(Llama) 2024-10-07T06:31:08.905115Z WARN text_generation_router::server: router/src/server.rs:2090: Invalid hostname, defaulting to 0.0.0.0 2024-10-07T06:31:08.954949Z INFO text_generation_router::server: router/src/server.rs:2477: Connected

Expected behavior

Since I have specified in env variables MAX_TOTAL_TOKENS=6024 and MAX_BATCH_SIZE=1 I would expect the total tokens max batch total tokens to be 6024. Instead as can be seen in the logs the inferred max batch total tokens is set to be 1 084 130 and the VRAM usage goes up to 160GB! According to my calculations (based on this article), model should use 16GB of memory plus extra 3GB for 6024 tokens - 0,5MiB for each token for this particular model, correct me if I'm wrong.

To sum up:

Expected VRAM Usage: 20GB Actual VRAM Usage: 160GB

What can be a cause for such behavior? Am I doing something wrong or is it a bug?

Narsil commented 1 day ago

TGI will always use all the allowed memory for KV-cache, to allow MANY users on the same machine.

Specifying MAX_BATCH_SIZE is not used on Nvidia targets as mentionned in the docs: https://huggingface.co/docs/text-generation-inference/main/en/reference/launcher#maxbatchsize

If you want to control how much VRAM you use, you need to set --cuda-memory-fraction 0.3 for instance to use 30% of available VRAM. It will adjust the amount of available tokens in flight automatically (nvidia targets do not consider number of users/requests, and uniquely the amount of tokens in a batch, and this number if derived automatically).