Open lkm2835 opened 3 months ago
No response
examples
python /app/tensorrt_llm/examples/llama/convert_checkpoint.py \ --model_dir /app/models \ --output_dir /app/models/tensorrt \ --dtype float16 \ --tp_size 2
trtllm-build --checkpoint_dir /app/models/tensorrt \ --remove_input_padding enable \ --gpt_attention_plugin float16 \ --context_fmha enable \ --gemm_plugin float16 \ --output_dir /app/models/tensorrt_llm/context_fmha \ --paged_kv_cache disable \ --enable_xqa disable \ --multi_block_mode disable \ --tp_size 2 \ --max_batch_size 1 \ --max_input_len 4096 \ --max_output_len 2048
mkdir /app/models/triton_model cp -r /app/all_models/inflight_batcher_llm/* /app/models/triton_model python3 /app/tools/fill_template.py -i /app/models/triton_model/preprocessing/config.pbtxt tokenizer_dir:/app/models/,triton_max_batch_size:1,preprocessing_instance_count:1 python3 /app/tools/fill_template.py -i /app/models/triton_model/postprocessing/config.pbtxt tokenizer_dir:/app/models/,triton_max_batch_size:1,postprocessing_instance_count:1 python3 /app/tools/fill_template.py -i /app/models/triton_model/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:1,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False python3 /app/tools/fill_template.py -i /app/models/triton_model/ensemble/config.pbtxt triton_max_batch_size:1 python3 /app/tools/fill_template.py -i /app/models/triton_model/tensorrt_llm/config.pbtxt triton_max_batch_size:1,decoupled_mode:False,max_beam_width:1,engine_dir:/app/models/tensorrt_llm/context_fmha,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:v1,max_queue_delay_microseconds:0
It works well without hanging.
+-----------------------------------------+----------------------+----------------------+ | 0 NVIDIA A100-SXM4-40GB On | 00000000:00:06.0 Off | 0 | | N/A 35C P0 76W / 400W | 11138MiB / 40960MiB | 100% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM4-40GB On | 00000000:00:07.0 Off | 0 | | N/A 39C P0 82W / 400W | 11106MiB / 40960MiB | 100% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+
TensorRT-LLM often hangs using both tp_size 2 and enable_context_fmha.
tp_size 2
enable_context_fmha
NA
@lkm2835 do you see this issues when using trt-llm examples directly without triton backends ?
@PerkzZheng I solved it temporarily. My solution is disable use_custom_all_reduce in trtllm-build.
use_custom_all_reduce
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
It works well without hanging.
actual behavior
TensorRT-LLM often hangs using both
tp_size 2
andenable_context_fmha
.additional notes
NA