triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
654 stars 93 forks source link

Unable to launch triton server with TP #577

Open dhruvmullick opened 3 weeks ago

dhruvmullick commented 3 weeks ago

System Info

Built tensorrtllm_backend from source using dockerfile/Dockerfile.trt_llm_backend

tensorrt_llm 0.13.0.dev2024081300 tritonserver 2.48.0 triton image: 24.07 Cuda 12.5

Who can help?

@Tracin @kaiyux @schetlur-nv

Information

Tasks

Reproduction

I've built a TRTLLM engine for meta llama 3 8B and I'm seeing the triton server get stuck while spawning if using tensor parallelism > 1.

Things work if I don't use tp while building the engine and spawning it.

Build the Engine:

python3 quantize.py --model_dir meta_llama_3_8B_instruct_fp16 \
        --dtype float16 \
            --qformat int4_awq \
        --awq_block_size 128 \
            --output_dir /tmp/trt_checkpoint \
        --batch_size 8 \
        --calib_size 32 \
        --tp_size 2

CUDA_VISIBLE_DEVICES=0,1 trtllm-build --checkpoint_dir /tmp/trt_checkpoint \
        --gemm_plugin float16 \
        --gpt_attention_plugin float16 \
        --kv_cache_type=paged \
        --remove_input_padding enable \
        --context_fmha enable \
        --use_paged_context_fmha enable \
        --max_seq_len 8000 \
        --max_num_tokens 4096 \
        --max_batch_size 8 \
        --output_dir trt_model \
        --log_level verbose \
        --multiple_profiles enable \
        --workers 2

Command used to launch the server:

python3 launch_triton_server.py --model_repo=triton_model_repo_copy     
--world_size 2  
--tensorrt_llm_model_name=meta_llama_3_8B_instruct_trt       
--log   
--log-file /tmp/logs.txt 
--force

Expected behavior

The server should spawn and start serving requests on localhost.

actual behavior

I see the logs on the console:

I0819 17:56:19.460307 16867 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0819 17:56:19.460335 16867 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I0819 17:56:19.752444 16867 model_lifecycle.cc:472] "loading: meta_llama_3_8B_instruct_trt:1"
I0819 17:56:19.918688 16867 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I0819 17:56:19.918732 16867 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I0819 17:56:19.918737 16867 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I0819 17:56:19.918742 16867 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"false\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_chunked_context is set to true, will use context chunking (requires building the model with use_paged_context_fmha).
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false
[TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false
I0819 17:56:19.933735 16867 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: meta_llama_3_8B_instruct_trt (version 1)"
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_chunked_context is set to true, will use context chunking (requires building the model with use_paged_context_fmha).
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false
[TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.13.0.dev2024081300 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Engine version 0.13.0.dev2024081300 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 1
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8000
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8000) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 7999  = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8000
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 8000 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8000) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 7999  = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 8000 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 2771 MiB
[TensorRT-LLM][INFO] Loaded engine size: 2771 MiB
[TensorRT-LLM][INFO] Detecting local TP group for rank 1
[TensorRT-LLM][INFO] Detecting local TP group for rank 0
[TensorRT-LLM][INFO] TP group is intra-node for rank 0
[TensorRT-LLM][INFO] TP group is intra-node for rank 1
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 240.02 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 240.02 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 1. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 1. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 2. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 2. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 3. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 3. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 4. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 4. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 5. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 5. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 3.44 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 3.44 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 15.05 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.14 GiB, available: 74.69 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 15.05 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.14 GiB, available: 74.69 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 17210
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 10728, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 17210
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 10728, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 125
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 125
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 67.23 GiB for max tokens in paged KV cache (1101440).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 67.23 GiB for max tokens in paged KV cache (1101440).
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_chunked_context is set to true, will use context chunking (requires building the model with use_paged_context_fmha).
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false
[TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.13.0.dev2024081300 found in the config file, assuming engine(s) built by new builder API.

In the /tmp/logs.txt file, I see the last output:

I0819 17:57:02.475316 16866 backend_model_instance.cc:783] "Starting backend thread for meta_llama_3_8B_instruct_trt_0_0 at nice 0 on device 0..."
I0819 17:57:02.475672 16866 backend_model.cc:675] "Created model instance named 'meta_llama_3_8B_instruct_trt_0_0' with device id '0'"

And nothing after this.

additional notes

NA

dhruvmullick commented 2 weeks ago

Even tried without quantization, following the steps given in the official examples

python convert_checkpoint.py --model_dir meta_llama_3_8B_instruct \
                            --output_dir /tmp/tllm_checkpoint_2gpu_tp2 \
                            --dtype bfloat16 \
                            --tp_size 2

trtllm-build --checkpoint_dir /tmp/tllm_checkpoint_2gpu_tp2 \
            --output_dir meta_llama_3_1_8B_instruct/bf16/2-gpu/ \
        --max_batch_size 8 \
            --gemm_plugin auto

Still stuck.

Tried making batch size consistent between triton model config and built engine, but no gain.

dhruvmullick commented 2 weeks ago

I tried the official image: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3 which was launched 2 days back, and built the TRT engines from this

Problem remains though, even with reduce_fusion enabled. Logs below:

Logs

``` root@763cf08503e3:/workspace/dhruv_artificial_agency/inference_service# python3 launch_triton_server.py --world_size=2 --model_repo=models_dhruv --log --tensorrt_llm_model_name=meta_llama_3_1_8B_instruct_vanilla_trt root@763cf08503e3:/workspace/dhruv_artificial_agency/inference_service# I0830 02:11:11.780256 2914 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7fe13e000000' with size 268435456" I0830 02:11:11.797627 2914 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864" I0830 02:11:11.797656 2914 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864" I0830 02:11:12.083192 2914 model_lifecycle.cc:472] "loading: meta_llama_3_1_8B_instruct_vanilla_trt:1" I0830 02:11:12.265538 2914 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm" I0830 02:11:12.265587 2914 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19" I0830 02:11:12.265592 2914 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19" I0830 02:11:12.265597 2914 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"false\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}" I0830 02:11:12.281716 2914 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: meta_llama_3_1_8B_instruct_vanilla_trt (version 1)" [TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000 [TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0 [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false [TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false [TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000 [TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0 [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false [TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false [TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0 [TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0 [TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty [TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty [TensorRT-LLM][INFO] Engine version 0.12.0.dev2024080600 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.12.0.dev2024080600 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initialized MPI [TensorRT-LLM][INFO] Initialized MPI [TensorRT-LLM][INFO] Refreshed the MPI local session [TensorRT-LLM][INFO] Refreshed the MPI local session [TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 0 [TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 1 [TensorRT-LLM][INFO] Rank 0 is using GPU 0 [TensorRT-LLM][INFO] Rank 1 is using GPU 1 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8 [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1 [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1048576 [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1048576 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192 [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens). [TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8 [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1 [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1048576 [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1048576 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192 [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens). [TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None [TensorRT-LLM][INFO] Loaded engine size: 8231 MiB [TensorRT-LLM][INFO] Loaded engine size: 8231 MiB [TensorRT-LLM][INFO] Detecting local TP group for rank 1 [TensorRT-LLM][INFO] Detecting local TP group for rank 0 [TensorRT-LLM][INFO] TP group is intra-node for rank 0 [TensorRT-LLM][INFO] TP group is intra-node for rank 1 [TensorRT-LLM][INFO] [MemUsageChange] Allocated 480.04 MiB for execution context memory. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 480.04 MiB for execution context memory. [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 8223 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 8223 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Allocated 195.95 MB GPU memory for runtime buffers. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 195.95 MB GPU memory for runtime buffers. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 148.07 MB GPU memory for decoder. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 148.07 MB GPU memory for decoder. [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.14 GiB, available: 68.94 GiB [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.14 GiB, available: 68.94 GiB [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 15885 [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 10728, onboard blocks to primary memory before reuse: true [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 15885 [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 10728, onboard blocks to primary memory before reuse: true [TensorRT-LLM][WARNING] maxAttentionWindow and maxSequenceLen are too large for at least one sequence to fit in kvCache. they are reduced to 1016640 [TensorRT-LLM][WARNING] maxAttentionWindow and maxSequenceLen are too large for at least one sequence to fit in kvCache. they are reduced to 1016640 [TensorRT-LLM][INFO] Max KV cache pages per sequence: 15885 [TensorRT-LLM][INFO] Max KV cache pages per sequence: 15885 [TensorRT-LLM][INFO] Number of tokens per block: 64. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 62.05 GiB for max tokens in paged KV cache (1016640). [TensorRT-LLM][INFO] Number of tokens per block: 64. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 62.05 GiB for max tokens in paged KV cache (1016640). [TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms) [TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms) [TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms) [TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms) [TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000 [TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0 [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false [TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false [TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0 [TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty [TensorRT-LLM][INFO] Engine version 0.12.0.dev2024080600 found in the config file, assuming engine(s) built by new builder API. ```

Way to recreate:

  1. Enter image nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
  2. Use the same commands as https://github.com/triton-inference-server/tensorrtllm_backend/issues/577#issuecomment-2318325493
  3. Use the config file:
    Config


name: "meta_llama_3_1_8B_instruct_vanilla_trt"
backend: "tensorrtllm"
max_batch_size: 8

model_transaction_policy {
  decoupled: False
}

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    allow_ragged_batch: true
  },
  {
    name: "input_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "draft_input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "decoder_input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "decoder_input_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
    reshape: { shape: [ ] }
  },
  {
    name: "draft_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "draft_acceptance_threshold"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_INT32
    dims: [ 1 ] # TRTLLM only supports a single end id
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "pad_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "embedding_bias"
    data_type: TYPE_FP32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "beam_width"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_k"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p_min"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p_decay"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p_reset_ids"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "early_stopping"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "min_length"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_search_diversity_rate"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "presence_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "frequency_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_context_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_generation_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "stop"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "streaming"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "prompt_embedding_table"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "prompt_vocab_size"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  # the unique task ID for the given LoRA.
  # To perform inference with a specific LoRA for the first time `lora_task_id` `lora_weights` and `lora_config` must all be given.
  # The LoRA will be cached, so that subsequent requests for the same task only require `lora_task_id`.
  # If the cache is full the oldest LoRA will be evicted to make space for new ones.  An error is returned if `lora_task_id` is not cached.
  {
    name: "lora_task_id"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  # weights for a lora adapter shape [ num_lora_modules_layers, D x Hi + Ho x D ]
  # where the last dimension holds the in / out adapter weights for the associated module (e.g. attn_qkv) and model layer
  # each of the in / out tensors are first flattened and then concatenated together in the format above.
  # D=adapter_size (R value), Hi=hidden_size_in, Ho=hidden_size_out.
  {
    name: "lora_weights"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  # module identifier (same size a first dimension of lora_weights)
  # See LoraModule::ModuleType for model id mapping
  #
  # "attn_qkv": 0     # compbined qkv adapter
  # "attn_q": 1       # q adapter
  # "attn_k": 2       # k adapter
  # "attn_v": 3       # v adapter
  # "attn_dense": 4   # adapter for the dense layer in attention
  # "mlp_h_to_4h": 5  # for llama2 adapter for gated mlp layer after attention / RMSNorm: up projection
  # "mlp_4h_to_h": 6  # for llama2 adapter for gated mlp layer after attention / RMSNorm: down projection
  # "mlp_gate": 7     # for llama2 adapter for gated mlp later after attention / RMSNorm: gate
  #
  # last dim holds [ module_id, layer_idx, adapter_size (D aka R value) ]
  {
    name: "lora_config"
    data_type: TYPE_INT32
    dims: [ -1, 3 ]
    optional: true
    allow_ragged_batch: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "context_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "generation_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  },
  {
    name: "batch_index"
    data_type: TYPE_INT32
    dims: [ 1 ]
  }
]
instance_group [
  {
    count: 1
    kind : KIND_GPU
    gpus: [ 0, 1 ]
  }
]
parameters: {
  key: "max_beam_width"
  value: {
    string_value: "1"
  }
}
parameters: {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: {
    string_value: "no"
  }
}
parameters: {
  key: "gpt_model_type"
  value: {
    string_value: "inflight_fused_batching"
  }
}
parameters: {
  key: "gpt_model_path"
  value: {
    string_value: "data/trt_models/meta_llama_3_1_8B_instruct/vanilla_06_08_24_4bit"
  }
}
parameters: {
  key: "encoder_model_path"
  value: {
  }
}
parameters: {
  key: "max_tokens_in_paged_kv_cache"
  value: {
  }
}
parameters: {
  key: "max_attention_window_size"
  value: {
    string_value: "max_sequence_length"
  }
}
parameters: {
  key: "sink_token_length"
  value: {
  }
}
parameters: {
  key: "batch_scheduler_policy"
  value: {
    string_value: "max_utilization"
  }
}
parameters: {
  key: "kv_cache_free_gpu_mem_fraction"
  value: {
    string_value: "0.9"
  }
}
parameters: {
  key: "kv_cache_host_memory_bytes"
  value: {
    string_value: "45000000000"
  }
}
parameters: {
  key: "exclude_input_in_output"
  value: {
    string_value: "true"
  }
}
parameters: {
  key: "cancellation_check_period_ms"
  value: {
  }
}
parameters: {
  key: "stats_check_period_ms"
  value: {
  }
}
parameters: {
  key: "iter_stats_max_iterations"
  value: {
  }
}
parameters: {
  key: "request_stats_max_iterations"
  value: {
  }
}
parameters: {
  key: "enable_kv_cache_reuse"
  value: {
    string_value: "false"
  }
}
parameters: {
  key: "normalize_log_probs"
  value: {
    string_value: "true"
  }
}
parameters: {
  key: "enable_chunked_context"
  value: {
    string_value: "false"
  }
}
parameters: {
  key: "gpu_device_ids"
  value: {
    string_value: "0, 1"
  }
}
parameters: {
  key: "lora_cache_optimal_adapter_size"
  value: {
  }
}
parameters: {
  key: "lora_cache_max_adapter_size"
  value: {
  }
}
parameters: {
  key: "lora_cache_gpu_memory_fraction"
  value: {
  }
}
parameters: {
  key: "lora_cache_host_memory_bytes"
  value: {
  }
}
parameters: {
  key: "decoding_mode"
  value: {
    string_value: "top_k"
  }
}
parameters: {
  key: "executor_worker_path"
  value: {
    string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker"
  }
}
parameters: {
  key: "medusa_choices"
    value: {
  }
}
parameters: {
  key: "gpu_weights_percent"
    value: {
  }
}
parameters: {
  key: "enable_context_fmha_fp32_acc"
  value: {
    string_value: "false"
  }
}
parameters: {
  key: "multi_block_mode"
  value: {
    string_value: "false"
  }
}
parameters: {
  key: "max_num_tokens"
  value: {
    string_value: "16384"
  }
}

Spawn triton server using python3 launch_triton_server.py --world_size=2 --model_repo=models_dhruv --log --tensorrt_llm_model_name=meta_llama_3_1_8B_instruct_vanilla_trt