Open ukito-pl opened 1 month ago
TGI will always use all the allowed memory for KV-cache, to allow MANY users on the same machine.
Specifying MAX_BATCH_SIZE is not used on Nvidia targets as mentionned in the docs: https://huggingface.co/docs/text-generation-inference/main/en/reference/launcher#maxbatchsize
If you want to control how much VRAM you use, you need to set --cuda-memory-fraction 0.3
for instance to use 30% of available VRAM. It will adjust the amount of available tokens in flight automatically (nvidia targets do not consider number of users/requests, and uniquely the amount of tokens in a batch, and this number if derived automatically).
System Info
Information
Tasks
Reproduction
Steps to reproduce:
Output logs: 2024-10-07T06:30:47.292774Z INFO text_generation_launcher: Args { model_id: "meta-llama/Meta-Llama-3.1-8B-Instruct", revision: None, validation_workers: 2, sharded: Some( true, ), num_shard: Some( 4, ), quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: Some( 5000, ), max_input_length: None, max_total_tokens: Some( 6024, ), waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some( 1, ), cuda_graphs: None, hostname: "eeb1ec72b169", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], api_key: Some( "xxx", ), watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4, lora_adapters: None, usage_stats: On, } 2024-10-07T06:30:47.293479Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching true 2024-10-07T06:30:47.293484Z INFO text_generation_launcher: Default
max_batch_prefill_tokens
to 5000 2024-10-07T06:30:47.293487Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32] 2024-10-07T06:30:47.293489Z INFO text_generation_launcher: Sharding model on 4 processes 2024-10-07T06:30:47.293556Z INFO download: text_generation_launcher: Starting check and download process for meta-llama/Meta-Llama-3.1-8B-Instruct 2024-10-07T06:30:49.757621Z INFO text_generation_launcher: Files are already present on the host. Skipping download. 2024-10-07T06:30:50.197536Z INFO download: text_generation_launcher: Successfully downloaded weights for meta-llama/Meta-Llama-3.1-8B-Instruct 2024-10-07T06:30:50.197749Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2024-10-07T06:30:50.197801Z INFO shard-manager: text_generation_launcher: Starting shard rank=1 2024-10-07T06:30:50.198341Z INFO shard-manager: text_generation_launcher: Starting shard rank=2 2024-10-07T06:30:50.198363Z INFO shard-manager: text_generation_launcher: Starting shard rank=3 2024-10-07T06:30:52.534518Z INFO text_generation_launcher: Using prefix caching = True 2024-10-07T06:30:52.534550Z INFO text_generation_launcher: Using Attention = flashinfer 2024-10-07T06:31:00.208692Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1 2024-10-07T06:31:00.209210Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-10-07T06:31:00.209701Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2 2024-10-07T06:31:00.209782Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3 2024-10-07T06:31:03.994463Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1 2024-10-07T06:31:04.012487Z INFO shard-manager: text_generation_launcher: Shard ready in 13.812430798s rank=1 2024-10-07T06:31:04.291933Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0 2024-10-07T06:31:04.292206Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-3 2024-10-07T06:31:04.292206Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-2 2024-10-07T06:31:04.313117Z INFO shard-manager: text_generation_launcher: Shard ready in 14.113306239s rank=0 2024-10-07T06:31:04.313524Z INFO shard-manager: text_generation_launcher: Shard ready in 14.113391818s rank=3 2024-10-07T06:31:04.313770Z INFO shard-manager: text_generation_launcher: Shard ready in 14.113394813s rank=2 2024-10-07T06:31:04.411975Z INFO text_generation_launcher: Starting Webserver 2024-10-07T06:31:04.490925Z INFO text_generation_router_v3: backends/v3/src/lib.rs:90: Warming up model 2024-10-07T06:31:05.160689Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1] 2024-10-07T06:31:06.175895Z INFO text_generation_router_v3: backends/v3/src/lib.rs:102: Setting max batch total tokens to 1084130 2024-10-07T06:31:06.175942Z INFO text_generation_router_v3: backends/v3/src/lib.rs:126: Using backend V3 2024-10-07T06:31:06.175988Z INFO text_generation_router::server: router/src/server.rs:1797: Using the Hugging Face API 2024-10-07T06:31:06.908180Z INFO text_generation_router::server: router/src/server.rs:2515: Serving revision 0e9e39f249a16976918f6564b8830bc894c89659 of model meta-llama/Llama-3.1-8B-Instruct 2024-10-07T06:31:08.905070Z INFO text_generation_router::server: router/src/server.rs:1943: Using config Some(Llama) 2024-10-07T06:31:08.905115Z WARN text_generation_router::server: router/src/server.rs:2090: Invalid hostname, defaulting to 0.0.0.0 2024-10-07T06:31:08.954949Z INFO text_generation_router::server: router/src/server.rs:2477: ConnectedExpected behavior
Since I have specified in env variables
MAX_TOTAL_TOKENS=6024
andMAX_BATCH_SIZE=1
I would expect the total tokensmax batch total tokens
to be 6024. Instead as can be seen in the logs the inferredmax batch total tokens
is set to be 1 084 130 and the VRAM usage goes up to 160GB! According to my calculations (based on this article), model should use 16GB of memory plus extra 3GB for 6024 tokens - 0,5MiB for each token for this particular model, correct me if I'm wrong.To sum up:
Expected VRAM Usage: 20GB Actual VRAM Usage: 160GB
What can be a cause for such behavior? Am I doing something wrong or is it a bug?