The GPU memory usage is too high.

System Info

cpu intel 14700k gpu rtx 4090 tensorrt_llm 0.13 docker tritonserver:24.09-trtllm-python-py3

Who can help?

@Tracin

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

reference: python/openai

Expected behavior

When I run using openai\openai_frontend\main.py and specify the 8-bit quantized ChatGLM4 model with a size of 9.95G, it is expected to be around 12G. However, during inference, the entire 24G of GPU memory is filled up.

actual behavior

root@docker-desktop:/llm/openai# python3 /llm/openai/openai_frontend/main.py --backend tensorrtllm --model-repository /llm/tensorrt_llm/model_repo --tokenizer /llm/tensorrt_llm/tokenizer_dir I1019 19:28:29.074797 1272 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x204c00000' with size 268435456" I1019 19:28:29.074857 1272 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864" I1019 19:28:29.173311 1272 model_lifecycle.cc:472] "loading: preprocessing:1" I1019 19:28:29.177154 1272 model_lifecycle.cc:472] "loading: postprocessing:1" I1019 19:28:29.180251 1272 model_lifecycle.cc:472] "loading: tensorrt_llm:1" I1019 19:28:29.183750 1272 model_lifecycle.cc:472] "loading: tensorrt_llm_bls:1" I1019 19:28:29.413431 1272 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm" I1019 19:28:29.413459 1272 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19" I1019 19:28:29.413473 1272 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19" I1019 19:28:29.413485 1272 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}" [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set I1019 19:28:29.416281 1272 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)" [TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1 [TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000 [TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0 [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache [TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0 [TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value [TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to true [TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0 [TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop [TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty [TensorRT-LLM][INFO] Engine version 0.13.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initialized MPI [TensorRT-LLM][INFO] Refreshed the MPI local session [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0 [TensorRT-LLM][INFO] Rank 0 is using GPU 0 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1 [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1 [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048 [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2048) * 40 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 2048 [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens). [TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None I1019 19:28:31.814801 1272 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)" I1019 19:28:31.890846 1272 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)" I1019 19:28:31.891695 1272 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)" I1019 19:28:32.878905 1272 model_lifecycle.cc:839] "successfully loaded 'tensorrt_llm_bls'" [TensorRT-LLM][WARNING] Don't setup 'add_special_tokens' correctly (set value is ${add_special_tokens}). Set it as True by default. [TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is ${skip_special_tokens}). Set it as True by default. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. I1019 19:28:33.534115 1272 model_lifecycle.cc:839] "successfully loaded 'postprocessing'" I1019 19:28:33.534477 1272 model_lifecycle.cc:839] "successfully loaded 'preprocessing'" [TensorRT-LLM][INFO] Loaded engine size: 10194 MiB [TensorRT-LLM][INFO] [MemUsageChange] Allocated 196.76 MiB for execution context memory. [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 10184 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Allocated 648.06 KB GPU memory for runtime buffers. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.98 MB GPU memory for decoder. [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 23.99 GiB, available: 12.22 GiB [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 4506 [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true [TensorRT-LLM][INFO] Max KV cache pages per sequence: 32 [TensorRT-LLM][INFO] Number of tokens per block: 64. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 11.00 GiB for max tokens in paged KV cache (288384). [TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false [TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms) [TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms) I1019 19:28:53.616560 1272 libtensorrtllm.cc:184] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_0_0" I1019 19:28:53.616756 1272 model_lifecycle.cc:839] "successfully loaded 'tensorrt_llm'" I1019 19:28:53.619599 1272 model_lifecycle.cc:472] "loading: ensemble:1" I1019 19:28:53.619778 1272 model_lifecycle.cc:839] "successfully loaded 'ensemble'" I1019 19:28:53.619824 1272 server.cc:604] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+

I1019 19:28:53.619851 1272 server.cc:631] +-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+ | Backend | Path | Config | +-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+ | python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","mi | | | | n-compute-capability":"6.000000","default-max-batch-size":"4"}} | | tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","mi | | | | n-compute-capability":"6.000000","default-max-batch-size":"4"}} | +-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+

I1019 19:28:53.644979 1272 metrics.cc:877] "Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090" I1019 19:28:53.648626 1272 metrics.cc:770] "Collecting CPU metrics" I1019 19:28:53.648919 1272 tritonserver.cc:2598] +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.50.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_s | | | hared_memory binary_tensor_data parameters statistics trace logging | | model_repository_path[0] | /llm/tensorrt_llm/model_repo | | model_control_mode | MODE_NONE | | strict_model_config | 0 | | model_config_name | | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. Found model: name='ensemble', backend='ensemble' Found model: name='postprocessing', backend='python' Found model: name='preprocessing', backend='python' Found model: name='tensorrt_llm', backend='tensorrtllm' Found model: name='tensorrt_llm_bls', backend='python' [WARNING] Adding CORS for the following origins: ['http://localhost'] INFO: Started server process [1272] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)

additional notes

I'm not sure if it is controlled by the free_gpu_memory_fraction in tensorrt_llm. How can this be solved?

triton-inference-server / tensorrtllm_backend