triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
581 stars 81 forks source link

LLama 7B model can't get longer ouput text after using triton server #434

Closed XiaobingSuper closed 1 month ago

XiaobingSuper commented 2 months ago

System Info

Who can help?

@byshiue @schetlur-nv

Information

Tasks

Reproduction

export TARGET_HF_LLAMA_MODEL=llama-7b-chat-hf/
export TARGET_UNIFIED_CKPT_PATH=/tmp/ckpt/llama/7b/
export TARGET_ENGINE_PATH=/tmp/engines/llama/7b/
python convert_checkpoint.py --model_dir ${TARGET_HF_LLAMA_MODEL} \
                             --output_dir ${TARGET_UNIFIED_CKPT_PATH} \
                             --dtype float16

trtllm-build --checkpoint_dir ${TARGET_UNIFIED_CKPT_PATH} \
             --remove_input_padding enable \
             --gpt_attention_plugin float16 \
             --context_fmha enable \
             --gemm_plugin float16 \
             --output_dir ${TARGET_ENGINE_PATH} \
             --paged_kv_cache enable \
             --max_batch_size 64 \
             --max_output_len 2048 \
             --use_paged_context_fmha enable
cp all_models/inflight_batcher_llm llama -r

python3 tools/fill_template.py -i llama/preprocessing/config.pbtxt tokenizer_dir:${TARGET_HF_LLAMA_MODEL},triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i llama/postprocessing/config.pbtxt tokenizer_dir:${TARGET_HF_LLAMA_MODEL},triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i llama/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i llama/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i llama/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${TARGET_ENGINE_PATH},max_tokens_in_paged_kv_cache:25600,max_attention_window_size:25600,kv_cache_free_gpu_mem_fraction:0.9,enable_kv_cache_reuse:True,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
pip install SentencePiece
python3 scripts/launch_triton_server.py --world_size 1 --model_repo=llama/

Expected behavior

We can get the expected longer output text when increasing the --output-len number bigger than 711.

actual behavior

When the --output-len is bigger than 711, the output text length will not increase, but if I using this python script https://github.com/NVIDIA/TensorRT-LLM/blob/v0.9.0/examples/run.py, the output text length still increases even I increase the --output-len to 1024(the output text length is 5065):

python run.py --tokenizer_dir /lpai/volumes/cloudmodel-muses/lt/models/Llama-2-7b-chat-hf/ --engine_dir=/lpai/trt_engines/llama/7B/trt_engines/fp16/1-gpu/ --max_output_len=1024 --input_text='What is machine learning?'

additional notes

no