Open dhruvmullick opened 3 months ago
Even tried without quantization, following the steps given in the official examples
python convert_checkpoint.py --model_dir meta_llama_3_8B_instruct \
--output_dir /tmp/tllm_checkpoint_2gpu_tp2 \
--dtype bfloat16 \
--tp_size 2
trtllm-build --checkpoint_dir /tmp/tllm_checkpoint_2gpu_tp2 \
--output_dir meta_llama_3_1_8B_instruct/bf16/2-gpu/ \
--max_batch_size 8 \
--gemm_plugin auto
Still stuck.
Tried making batch size consistent between triton model config and built engine, but no gain.
I tried the official image: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3 which was launched 2 days back, and built the TRT engines from this
Problem remains though, even with reduce_fusion enabled. Logs below:
``` root@763cf08503e3:/workspace/dhruv_artificial_agency/inference_service# python3 launch_triton_server.py --world_size=2 --model_repo=models_dhruv --log --tensorrt_llm_model_name=meta_llama_3_1_8B_instruct_vanilla_trt root@763cf08503e3:/workspace/dhruv_artificial_agency/inference_service# I0830 02:11:11.780256 2914 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7fe13e000000' with size 268435456" I0830 02:11:11.797627 2914 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864" I0830 02:11:11.797656 2914 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864" I0830 02:11:12.083192 2914 model_lifecycle.cc:472] "loading: meta_llama_3_1_8B_instruct_vanilla_trt:1" I0830 02:11:12.265538 2914 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm" I0830 02:11:12.265587 2914 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19" I0830 02:11:12.265592 2914 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19" I0830 02:11:12.265597 2914 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"false\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}" I0830 02:11:12.281716 2914 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: meta_llama_3_1_8B_instruct_vanilla_trt (version 1)" [TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000 [TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0 [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false [TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false [TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000 [TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0 [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false [TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false [TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0 [TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0 [TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty [TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty [TensorRT-LLM][INFO] Engine version 0.12.0.dev2024080600 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.12.0.dev2024080600 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initialized MPI [TensorRT-LLM][INFO] Initialized MPI [TensorRT-LLM][INFO] Refreshed the MPI local session [TensorRT-LLM][INFO] Refreshed the MPI local session [TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 0 [TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 1 [TensorRT-LLM][INFO] Rank 0 is using GPU 0 [TensorRT-LLM][INFO] Rank 1 is using GPU 1 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8 [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1 [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1048576 [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1048576 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192 [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens). [TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8 [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1 [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1048576 [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1048576 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192 [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens). [TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None [TensorRT-LLM][INFO] Loaded engine size: 8231 MiB [TensorRT-LLM][INFO] Loaded engine size: 8231 MiB [TensorRT-LLM][INFO] Detecting local TP group for rank 1 [TensorRT-LLM][INFO] Detecting local TP group for rank 0 [TensorRT-LLM][INFO] TP group is intra-node for rank 0 [TensorRT-LLM][INFO] TP group is intra-node for rank 1 [TensorRT-LLM][INFO] [MemUsageChange] Allocated 480.04 MiB for execution context memory. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 480.04 MiB for execution context memory. [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 8223 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 8223 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Allocated 195.95 MB GPU memory for runtime buffers. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 195.95 MB GPU memory for runtime buffers. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 148.07 MB GPU memory for decoder. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 148.07 MB GPU memory for decoder. [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.14 GiB, available: 68.94 GiB [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.14 GiB, available: 68.94 GiB [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 15885 [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 10728, onboard blocks to primary memory before reuse: true [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 15885 [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 10728, onboard blocks to primary memory before reuse: true [TensorRT-LLM][WARNING] maxAttentionWindow and maxSequenceLen are too large for at least one sequence to fit in kvCache. they are reduced to 1016640 [TensorRT-LLM][WARNING] maxAttentionWindow and maxSequenceLen are too large for at least one sequence to fit in kvCache. they are reduced to 1016640 [TensorRT-LLM][INFO] Max KV cache pages per sequence: 15885 [TensorRT-LLM][INFO] Max KV cache pages per sequence: 15885 [TensorRT-LLM][INFO] Number of tokens per block: 64. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 62.05 GiB for max tokens in paged KV cache (1016640). [TensorRT-LLM][INFO] Number of tokens per block: 64. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 62.05 GiB for max tokens in paged KV cache (1016640). [TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms) [TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms) [TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms) [TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms) [TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000 [TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0 [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false [TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false [TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0 [TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty [TensorRT-LLM][INFO] Engine version 0.12.0.dev2024080600 found in the config file, assuming engine(s) built by new builder API. ```
Way to recreate:
name: "meta_llama_3_1_8B_instruct_vanilla_trt"
backend: "tensorrtllm"
max_batch_size: 8
model_transaction_policy {
decoupled: False
}
input [
{
name: "input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
allow_ragged_batch: true
},
{
name: "input_lengths"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "request_output_len"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "draft_input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "decoder_input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "decoder_input_lengths"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
reshape: { shape: [ ] }
},
{
name: "draft_logits"
data_type: TYPE_FP32
dims: [ -1, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "draft_acceptance_threshold"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "end_id"
data_type: TYPE_INT32
dims: [ 1 ] # TRTLLM only supports a single end id
reshape: { shape: [ ] }
optional: true
},
{
name: "pad_id"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "stop_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "bad_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "embedding_bias"
data_type: TYPE_FP32
dims: [ -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "beam_width"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "temperature"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_k"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p_min"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p_decay"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p_reset_ids"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "len_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "early_stopping"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "repetition_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "min_length"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "beam_search_diversity_rate"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "presence_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "frequency_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "random_seed"
data_type: TYPE_UINT64
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_log_probs"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_context_logits"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_generation_logits"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "stop"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "streaming"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "prompt_embedding_table"
data_type: TYPE_FP16
dims: [ -1, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "prompt_vocab_size"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
# the unique task ID for the given LoRA.
# To perform inference with a specific LoRA for the first time `lora_task_id` `lora_weights` and `lora_config` must all be given.
# The LoRA will be cached, so that subsequent requests for the same task only require `lora_task_id`.
# If the cache is full the oldest LoRA will be evicted to make space for new ones. An error is returned if `lora_task_id` is not cached.
{
name: "lora_task_id"
data_type: TYPE_UINT64
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
# weights for a lora adapter shape [ num_lora_modules_layers, D x Hi + Ho x D ]
# where the last dimension holds the in / out adapter weights for the associated module (e.g. attn_qkv) and model layer
# each of the in / out tensors are first flattened and then concatenated together in the format above.
# D=adapter_size (R value), Hi=hidden_size_in, Ho=hidden_size_out.
{
name: "lora_weights"
data_type: TYPE_FP16
dims: [ -1, -1 ]
optional: true
allow_ragged_batch: true
},
# module identifier (same size a first dimension of lora_weights)
# See LoraModule::ModuleType for model id mapping
#
# "attn_qkv": 0 # compbined qkv adapter
# "attn_q": 1 # q adapter
# "attn_k": 2 # k adapter
# "attn_v": 3 # v adapter
# "attn_dense": 4 # adapter for the dense layer in attention
# "mlp_h_to_4h": 5 # for llama2 adapter for gated mlp layer after attention / RMSNorm: up projection
# "mlp_4h_to_h": 6 # for llama2 adapter for gated mlp layer after attention / RMSNorm: down projection
# "mlp_gate": 7 # for llama2 adapter for gated mlp later after attention / RMSNorm: gate
#
# last dim holds [ module_id, layer_idx, adapter_size (D aka R value) ]
{
name: "lora_config"
data_type: TYPE_INT32
dims: [ -1, 3 ]
optional: true
allow_ragged_batch: true
}
]
output [
{
name: "output_ids"
data_type: TYPE_INT32
dims: [ -1, -1 ]
},
{
name: "sequence_length"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "cum_log_probs"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "output_log_probs"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "context_logits"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "generation_logits"
data_type: TYPE_FP32
dims: [ -1, -1, -1 ]
},
{
name: "batch_index"
data_type: TYPE_INT32
dims: [ 1 ]
}
]
instance_group [
{
count: 1
kind : KIND_GPU
gpus: [ 0, 1 ]
}
]
parameters: {
key: "max_beam_width"
value: {
string_value: "1"
}
}
parameters: {
key: "FORCE_CPU_ONLY_INPUT_TENSORS"
value: {
string_value: "no"
}
}
parameters: {
key: "gpt_model_type"
value: {
string_value: "inflight_fused_batching"
}
}
parameters: {
key: "gpt_model_path"
value: {
string_value: "data/trt_models/meta_llama_3_1_8B_instruct/vanilla_06_08_24_4bit"
}
}
parameters: {
key: "encoder_model_path"
value: {
}
}
parameters: {
key: "max_tokens_in_paged_kv_cache"
value: {
}
}
parameters: {
key: "max_attention_window_size"
value: {
string_value: "max_sequence_length"
}
}
parameters: {
key: "sink_token_length"
value: {
}
}
parameters: {
key: "batch_scheduler_policy"
value: {
string_value: "max_utilization"
}
}
parameters: {
key: "kv_cache_free_gpu_mem_fraction"
value: {
string_value: "0.9"
}
}
parameters: {
key: "kv_cache_host_memory_bytes"
value: {
string_value: "45000000000"
}
}
parameters: {
key: "exclude_input_in_output"
value: {
string_value: "true"
}
}
parameters: {
key: "cancellation_check_period_ms"
value: {
}
}
parameters: {
key: "stats_check_period_ms"
value: {
}
}
parameters: {
key: "iter_stats_max_iterations"
value: {
}
}
parameters: {
key: "request_stats_max_iterations"
value: {
}
}
parameters: {
key: "enable_kv_cache_reuse"
value: {
string_value: "false"
}
}
parameters: {
key: "normalize_log_probs"
value: {
string_value: "true"
}
}
parameters: {
key: "enable_chunked_context"
value: {
string_value: "false"
}
}
parameters: {
key: "gpu_device_ids"
value: {
string_value: "0, 1"
}
}
parameters: {
key: "lora_cache_optimal_adapter_size"
value: {
}
}
parameters: {
key: "lora_cache_max_adapter_size"
value: {
}
}
parameters: {
key: "lora_cache_gpu_memory_fraction"
value: {
}
}
parameters: {
key: "lora_cache_host_memory_bytes"
value: {
}
}
parameters: {
key: "decoding_mode"
value: {
string_value: "top_k"
}
}
parameters: {
key: "executor_worker_path"
value: {
string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker"
}
}
parameters: {
key: "medusa_choices"
value: {
}
}
parameters: {
key: "gpu_weights_percent"
value: {
}
}
parameters: {
key: "enable_context_fmha_fp32_acc"
value: {
string_value: "false"
}
}
parameters: {
key: "multi_block_mode"
value: {
string_value: "false"
}
}
parameters: {
key: "max_num_tokens"
value: {
string_value: "16384"
}
}
Spawn triton server using python3 launch_triton_server.py --world_size=2 --model_repo=models_dhruv --log --tensorrt_llm_model_name=meta_llama_3_1_8B_instruct_vanilla_trt
@dhruvmullick I'm facing the same problem on my multi-GPU server with 4x L40S. Have you managed to solve it?
@imihic, after spending a week on this, I pivoted to vLLM.
System Info
Built tensorrtllm_backend from source using dockerfile/Dockerfile.trt_llm_backend
tensorrt_llm 0.13.0.dev2024081300 tritonserver 2.48.0 triton image: 24.07 Cuda 12.5
Who can help?
@Tracin @kaiyux @schetlur-nv
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I've built a TRTLLM engine for meta llama 3 8B and I'm seeing the triton server get stuck while spawning if using tensor parallelism > 1.
Things work if I don't use tp while building the engine and spawning it.
Build the Engine:
Command used to launch the server:
Expected behavior
The server should spawn and start serving requests on localhost.
actual behavior
I see the logs on the console:
In the /tmp/logs.txt file, I see the last output:
And nothing after this.
additional notes
NA