There is a problem with llama 7B model pre-processing after using triton server

System Info

TensorrtLLM: v0.9.0 Tensorrt_llm backend: v0.9.0 GPU: A100

Who can help?

@schetlur-nv

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

build engine python3 convert_checkpoint.py --workers 8 --tp_size 1 --dtype float16 --model_dir /mnt/volumes/cloudmodel-muses/lt/models/Llama-2-7b-hf/ --output_dir /lpai/trtllm_checkpoints/Llama-2-7b-hf-w16a16-no-int8kvcache

trtllm-build --checkpoint_dir /lpai/trtllm_checkpoints/Llama-2-7b-hf-w16a16-no-int8kvcache --output_dir /lpai/trt_engines/llama/7B/trt_engines/fp16-fp16-hf/1-gpu --gemm_plugin float16 --max_batch_size 64 --max_input_len 3000 --max_output_len 2048 --max_draft_len 10 --use_paged_context_fmha enable

set config export DRAFT_HF_LLAMA_MODEL=/mnt/volumes/cloudmodel-muses/lt/models/llama-68m export TARGET_HF_LLAMA_MODEL=/mnt/volumes/cloudmodel-muses/lt/models/Llama-2-7b-hf/ export TARGET_ENGINE_PATH=/lpai/trt_engines/llama/7B/trt_engines/fp16/1-gpu export DRAFT_ENGINE_PATH=/lpai/trt_engines/llama/68m/trt_engines/fp16/1-gpu cp all_models/spec_decode llama_7b -r

python3 tools/fill_template.py -i llama_7b/preprocessing/config.pbtxt tokenizer_dir:${DRAFT_HF_LLAMA_MODEL},triton_max_batch_size:64,preprocessing_instance_count:1 python3 tools/fill_template.py -i llama_7b/postprocessing/config.pbtxt tokenizer_dir:${TARGET_HF_LLAMA_MODEL},triton_max_batch_size:64,postprocessing_instance_count:1 python3 tools/fill_template.py -i llama_7b/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False python3 tools/fill_template.py -i llama_7b/ensemble/config.pbtxt triton_max_batch_size:64 python3 tools/fill_template.py -i llama_7b/tensorrt_llm_target/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${TARGET_ENGINE_PATH},max_tokens_in_paged_kv_cache:25600,max_attention_window_size:25600,kv_cache_free_gpu_mem_fraction:0.9,enable_kv_cache_reuse:True,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0 python3 tools/fill_template.py -i llama_7b/tensorrt_llm_draft/config.pbtxt triton_maxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${DRAFT_ENGINE_PATH},max_tokens_in_paged_kv_cache:25600,max_attention_window_size:25600,kv_cache_free_gpu_mem_fraction:0.9,enable_kv_cache_reuse:True,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

start serve python3 scripts/launch_triton_server.py --world_size=1 --model_repo /lpai/tensorrtllm_backend/llama_7b/
test result with different methods: triton server: curl -X POST localhost:7000/v2/models/ensemble/generate -d '{"text_input": "Write a short blog post (500 words) about the best dog toys for new dog owners.", "max_tokens": 512, "bad_words": "", "stop_words": "", "temperature":1.0, "top_k":1, "top_p":0.0, "length_penalty":1.0, "repetition_penalty": 1.0, "presence_penalty":0.0, "frequency_penalty":0.0}' {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"Write a short blog post (500 words) about the best dog toys for new dog owners.\nWrite a short blog post (500 words) about the best dog toys for new dog owners.\nWrite a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog"}

run.py engine: python3 run.py --max_output_len=512 --tokenizer_dir /mnt/volumes/cloudmodel-muses/lt/models/Llama-2-7b-hf/ --engine_dir=/lpai/trt_engines/llama/7B/trt_engines/fp16-fp16-hf/1-gpu --input_text "Write a short blog post (500 words) about the best dog toys for new dog owners." [TensorRT-LLM] TensorRT-LLM version: 0.9.0 [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 1, rank: 0 [TensorRT-LLM][INFO] Loaded engine size: 12855 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13009, GPU 14255 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 13011, GPU 14265 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +12852, now: CPU 0, GPU 12852 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13054, GPU 29849 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 13054, GPU 29857 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12852 (MiB) [TensorRT-LLM][INFO] Max tokens in paged KV cache: 92544. Allocating 48519708672 bytes. [TensorRT-LLM][INFO] Max KV cache pages per sequence: 5 Input [Text 0]: " Write a short blog post (500 words) about the best dog toys for new dog owners." Output [Text 0 Beam 0]: " Write a short blog post (500 words) about the best dog toys for new dog owners. The post should include a list of the best dog toys for new dog owners, along with a brief description of each toy. The best dog toys for new dog owners are those that are durable, safe, and fun. Some of the best dog toys for new dog owners include: -A Kong toy: A Kong toy is a durable, safe, and fun toy for new dog owners. Kong toys come in a variety of shapes and sizes, and they are perfect for dogs of all sizes. -A Nylabone: A Nylabone is a durable, safe, and fun toy for new dog owners. Nylabones come in a variety of shapes and sizes, and they are perfect for dogs of all sizes. -A Frisbee: A Frisbee is a durable, safe, and fun toy for new dog owners. Frisbees come in a variety of colors and sizes, and they are perfect for dogs of all sizes. -A ball: A ball is a durable, safe, and fun toy for new dog owners. Balls come in a variety of colors and sizes, and they are perfect for dogs of all sizes. -A rope toy: A rope toy is a durable, safe, and fun toy for new dog owners. Rope toys come in a variety of colors and sizes, and they are perfect for dogs of all sizes. -A chew toy: A chew toy is a durable, safe, and fun toy for new dog owners. Chew toys come in a variety of shapes and sizes, and they are perfect for dogs of all sizes. -A bone: A bone is a durable, safe, and fun toy for new dog owners. Bones come in a variety of shapes and sizes, and they are perfect for dogs of all sizes. -A toy that is made of a durable material: A toy that is made of a durable material is a durable, safe, and fun toy for new dog owners. Toys that are made of a durable material come in a variety of shapes and sizes, and they are perfect for dogs of all"

Expected behavior

for the same prompt，these two method should have the same output

actual behavior

outputs are different when using the same prompt

additional notes

no

triton-inference-server / tensorrtllm_backend