T5 model: Encountered an error when fetching new request: Prompt length (200) exceeds maximum input length (1)

System Info

L4 GPU GPU memory: 24 GB TensorRT LLM version: v0.10.0 container used: tritonserver:24.06-trtllm-python-py3

Who can help?

@byshiue @schetlur-nv

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

docker run -v /home/jaya_kommuru/:/home/jaya_kommuru/ -it --gpus=all --net=host --ipc=host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3

git clone https://github.com/NVIDIA/TensorRT-LLM.git git checkout tags/v0.10.0 cd TensorRT-LLM/examples/enc_dec/

git clone https://huggingface.co/google-t5/t5-small /tmp/hf_models/t5-small

export MODEL_NAME=t5-small export MODEL_TYPE=t5 # or bart export HF_MODEL_PATH=/tmp/hf_models/${MODEL_NAME} export UNIFIED_CKPT_PATH=/tmp/ckpt/${MODEL_NAME} export ENGINE_PATH=/tmp/engines/${MODEL_NAME}

python convert_checkpoint.py --model_type ${MODEL_TYPE} --model_dir ${HF_MODEL_PATH} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16

trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH}/tp1/pp1/encoder --output_dir ${ENGINE_PATH}/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --max_batch_size 64 --gemm_plugin float16 --bert_attention_plugin float16 --gpt_attention_plugin float16 --context_fmha disable

trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH}/tp1/pp1/decoder --output_dir ${ENGINE_PATH}/decoder --moe_plugin disable --enable_xqa disable --max_batch_size 64 --gemm_plugin float16 --bert_attention_plugin float16 --gpt_attention_plugin float16 --context_fmha disable --max_input_len 1 --max_encoder_input_len 2048

cd ../../../

git clone https://github.com/triton-inference-server/tensorrtllm_backend.git git checkout tags/v0.10.0

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH}/decoder,encoder_engine_dir:${ENGINE_PATH}/encoder,max_tokens_in_paged_kv_cache:4096,max_attention_window_size:4096,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,enable_chunked_context:False,max_queue_size:0

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_MODEL_PATH},triton_max_batch_size:64,preprocessing_instance_count:1

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_MODEL_PATH},triton_max_batch_size:64,postprocessing_instance_count:1 python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64 python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False

pip install SentencePiece python3 scripts/launch_triton_server.py --world_size 1 --model_repo=all_models/inflight_batcher_llm/

curl -X POST localhost:8000/v2/models/ensemble/generate -d "{\"text_input\": \"Summarize the following news article: (CNN)Following last year's successful U.K. tour, Prince and 3rdEyeGirl are bringing the Hit & Run Tour to the U.S. for the first time. The first -- and so far only -- scheduled show will take place in Louisville, Kentucky, the hometown of 3rdEyeGirl drummer Hannah Welton. Slated for March 14, tickets will go on sale Monday, March 9 at 10 a.m. local time. Prince crowns dual rock charts . A venue has yet to be announced. When the Hit & Run worked its way through the U.K. in 2014, concert venues were revealed via Twitter prior to each show. Portions of the ticket sales will be donated to various Louisville charities. See the original story at Billboard.com. ©2015 Billboard. All Rights Reserved.\", \"max_tokens\": 1024, \"bad_words\": \"\", \"stop_words\": \"\"}"

Expected behavior

The curl request should have generated some response

actual behavior

But its failing due to following error: {"error":"in ensemble 'ensemble', Executor failed process requestId 4 due to the following error: Encountered an error when fetching new request: Prompt length (200) exceeds maximum input length (1). (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:201)\n1 0x7f15ca33587f /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x6c387f) [0x7f15ca33587f]\n2 0x7f15cc2dcae2 tensorrt_llm::executor::Executor::Impl::executionLoop() + 722\n3 0x7f16bd1d8253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f16bd1d8253]\n4 0x7f16bcf67ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f16bcf67ac3]\n5 0x7f16bcff8a04 clone + 68"}

additional notes

Have been referring to this example: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/encoder_decoder.md

triton-inference-server / tensorrtllm_backend