LLama 7B model can't get longer ouput text after using triton server

System Info

TensorrtLLM: v0.9.0
Tensorrt_llm backend: v0.9.0
GPU: A100

Who can help?

@byshiue @schetlur-nv

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Build target model engine(https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)

export TARGET_HF_LLAMA_MODEL=llama-7b-chat-hf/
export TARGET_UNIFIED_CKPT_PATH=/tmp/ckpt/llama/7b/
export TARGET_ENGINE_PATH=/tmp/engines/llama/7b/
python convert_checkpoint.py --model_dir ${TARGET_HF_LLAMA_MODEL} \
                             --output_dir ${TARGET_UNIFIED_CKPT_PATH} \
                             --dtype float16

trtllm-build --checkpoint_dir ${TARGET_UNIFIED_CKPT_PATH} \
             --remove_input_padding enable \
             --gpt_attention_plugin float16 \
             --context_fmha enable \
             --gemm_plugin float16 \
             --output_dir ${TARGET_ENGINE_PATH} \
             --paged_kv_cache enable \
             --max_batch_size 64 \
             --max_output_len 2048 \
             --use_paged_context_fmha enable

Prepare configs

cp all_models/inflight_batcher_llm llama -r

python3 tools/fill_template.py -i llama/preprocessing/config.pbtxt tokenizer_dir:${TARGET_HF_LLAMA_MODEL},triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i llama/postprocessing/config.pbtxt tokenizer_dir:${TARGET_HF_LLAMA_MODEL},triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i llama/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i llama/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i llama/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${TARGET_ENGINE_PATH},max_tokens_in_paged_kv_cache:25600,max_attention_window_size:25600,kv_cache_free_gpu_mem_fraction:0.9,enable_kv_cache_reuse:True,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

Launch server

pip install SentencePiece
python3 scripts/launch_triton_server.py --world_size 1 --model_repo=llama/

Send request by end_to_end_grpc_client.py

python end_to_end_grpc_client.py --model-name 'ensemble' --prompt 'What is machine learning?' --output-len 1024

Expected behavior

We can get the expected longer output text when increasing the --output-len number bigger than 711.

actual behavior

When the --output-len is bigger than 711, the output text length will not increase, but if I using this python script https://github.com/NVIDIA/TensorRT-LLM/blob/v0.9.0/examples/run.py, the output text length still increases even I increase the --output-len to 1024(the output text length is 5065):

python run.py --tokenizer_dir /lpai/volumes/cloudmodel-muses/lt/models/Llama-2-7b-chat-hf/ --engine_dir=/lpai/trt_engines/llama/7B/trt_engines/fp16/1-gpu/ --max_output_len=1024 --input_text='What is machine learning?'

triton-inference-server / tensorrtllm_backend

LLama 7B model can't get longer ouput text after using triton server #434