triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
588 stars 81 forks source link

TensorRT-LLM often hangs using both `tp_size 2` and `enable_context_fmha`. #390

Open lkm2835 opened 3 months ago

lkm2835 commented 3 months ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

python /app/tensorrt_llm/examples/llama/convert_checkpoint.py \
                            --model_dir /app/models \
                            --output_dir /app/models/tensorrt \
                            --dtype float16 \
                            --tp_size 2
trtllm-build --checkpoint_dir /app/models/tensorrt \
             --remove_input_padding enable \
             --gpt_attention_plugin float16 \
             --context_fmha enable \
             --gemm_plugin float16 \
             --output_dir /app/models/tensorrt_llm/context_fmha \
             --paged_kv_cache disable \
             --enable_xqa disable \
             --multi_block_mode disable \
             --tp_size 2 \
             --max_batch_size 1 \
             --max_input_len 4096 \
             --max_output_len 2048
mkdir /app/models/triton_model
cp -r /app/all_models/inflight_batcher_llm/* /app/models/triton_model

python3 /app/tools/fill_template.py -i /app/models/triton_model/preprocessing/config.pbtxt tokenizer_dir:/app/models/,triton_max_batch_size:1,preprocessing_instance_count:1
python3 /app/tools/fill_template.py -i /app/models/triton_model/postprocessing/config.pbtxt tokenizer_dir:/app/models/,triton_max_batch_size:1,postprocessing_instance_count:1
python3 /app/tools/fill_template.py -i /app/models/triton_model/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:1,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 /app/tools/fill_template.py -i /app/models/triton_model/ensemble/config.pbtxt triton_max_batch_size:1
python3 /app/tools/fill_template.py -i /app/models/triton_model/tensorrt_llm/config.pbtxt triton_max_batch_size:1,decoupled_mode:False,max_beam_width:1,engine_dir:/app/models/tensorrt_llm/context_fmha,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:v1,max_queue_delay_microseconds:0

Expected behavior

It works well without hanging.

actual behavior

+-----------------------------------------+----------------------+----------------------+
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:00:06.0 Off |                    0 |
| N/A   35C    P0              76W / 400W |  11138MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:00:07.0 Off |                    0 |
| N/A   39C    P0              82W / 400W |  11106MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

TensorRT-LLM often hangs using both tp_size 2 and enable_context_fmha.

additional notes

NA

PerkzZheng commented 2 months ago

@lkm2835 do you see this issues when using trt-llm examples directly without triton backends ?

lkm2835 commented 2 months ago

@PerkzZheng I solved it temporarily. My solution is disable use_custom_all_reduce in trtllm-build.