triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
581 stars 81 forks source link

There is a problem with llama 7B model pre-processing after using triton server #445

Closed Graham1025 closed 1 month ago

Graham1025 commented 1 month ago

System Info

TensorrtLLM: v0.9.0 Tensorrt_llm backend: v0.9.0 GPU: A100

Who can help?

@schetlur-nv

Information

Tasks

Reproduction

  1. build engine python3 convert_checkpoint.py --workers 8 --tp_size 1 --dtype float16 --model_dir /mnt/volumes/cloudmodel-muses/lt/models/Llama-2-7b-hf/ --output_dir /lpai/trtllm_checkpoints/Llama-2-7b-hf-w16a16-no-int8kvcache

trtllm-build --checkpoint_dir /lpai/trtllm_checkpoints/Llama-2-7b-hf-w16a16-no-int8kvcache --output_dir /lpai/trt_engines/llama/7B/trt_engines/fp16-fp16-hf/1-gpu --gemm_plugin float16 --max_batch_size 64 --max_input_len 3000 --max_output_len 2048 --max_draft_len 10 --use_paged_context_fmha enable

  1. set config export DRAFT_HF_LLAMA_MODEL=/mnt/volumes/cloudmodel-muses/lt/models/llama-68m export TARGET_HF_LLAMA_MODEL=/mnt/volumes/cloudmodel-muses/lt/models/Llama-2-7b-hf/ export TARGET_ENGINE_PATH=/lpai/trt_engines/llama/7B/trt_engines/fp16/1-gpu export DRAFT_ENGINE_PATH=/lpai/trt_engines/llama/68m/trt_engines/fp16/1-gpu cp all_models/spec_decode llama_7b -r

python3 tools/fill_template.py -i llama_7b/preprocessing/config.pbtxt tokenizer_dir:${DRAFT_HF_LLAMA_MODEL},triton_max_batch_size:64,preprocessing_instance_count:1 python3 tools/fill_template.py -i llama_7b/postprocessing/config.pbtxt tokenizer_dir:${TARGET_HF_LLAMA_MODEL},triton_max_batch_size:64,postprocessing_instance_count:1 python3 tools/fill_template.py -i llama_7b/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False python3 tools/fill_template.py -i llama_7b/ensemble/config.pbtxt triton_max_batch_size:64 python3 tools/fill_template.py -i llama_7b/tensorrt_llm_target/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${TARGET_ENGINE_PATH},max_tokens_in_paged_kv_cache:25600,max_attention_window_size:25600,kv_cache_free_gpu_mem_fraction:0.9,enable_kv_cache_reuse:True,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0 python3 tools/fill_template.py -i llama_7b/tensorrt_llm_draft/config.pbtxt triton_maxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${DRAFT_ENGINE_PATH},max_tokens_in_paged_kv_cache:25600,max_attention_window_size:25600,kv_cache_free_gpu_mem_fraction:0.9,enable_kv_cache_reuse:True,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

  1. start serve python3 scripts/launch_triton_server.py --world_size=1 --model_repo /lpai/tensorrtllm_backend/llama_7b/

  2. test result with different methods: triton server: curl -X POST localhost:7000/v2/models/ensemble/generate -d '{"text_input": "Write a short blog post (500 words) about the best dog toys for new dog owners.", "max_tokens": 512, "bad_words": "", "stop_words": "", "temperature":1.0, "top_k":1, "top_p":0.0, "length_penalty":1.0, "repetition_penalty": 1.0, "presence_penalty":0.0, "frequency_penalty":0.0}' {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"Write a short blog post (500 words) about the best dog toys for new dog owners.\nWrite a short blog post (500 words) about the best dog toys for new dog owners.\nWrite a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog post (500 words) about the best dog toys for new dog owners. Write a short blog"}

run.py engine: python3 run.py --max_output_len=512 --tokenizer_dir /mnt/volumes/cloudmodel-muses/lt/models/Llama-2-7b-hf/ --engine_dir=/lpai/trt_engines/llama/7B/trt_engines/fp16-fp16-hf/1-gpu --input_text "Write a short blog post (500 words) about the best dog toys for new dog owners." [TensorRT-LLM] TensorRT-LLM version: 0.9.0 [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 1, rank: 0 [TensorRT-LLM][INFO] Loaded engine size: 12855 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13009, GPU 14255 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 13011, GPU 14265 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +12852, now: CPU 0, GPU 12852 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13054, GPU 29849 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 13054, GPU 29857 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12852 (MiB) [TensorRT-LLM][INFO] Max tokens in paged KV cache: 92544. Allocating 48519708672 bytes. [TensorRT-LLM][INFO] Max KV cache pages per sequence: 5 Input [Text 0]: " Write a short blog post (500 words) about the best dog toys for new dog owners." Output [Text 0 Beam 0]: " Write a short blog post (500 words) about the best dog toys for new dog owners. The post should include a list of the best dog toys for new dog owners, along with a brief description of each toy. The best dog toys for new dog owners are those that are durable, safe, and fun. Some of the best dog toys for new dog owners include: -A Kong toy: A Kong toy is a durable, safe, and fun toy for new dog owners. Kong toys come in a variety of shapes and sizes, and they are perfect for dogs of all sizes. -A Nylabone: A Nylabone is a durable, safe, and fun toy for new dog owners. Nylabones come in a variety of shapes and sizes, and they are perfect for dogs of all sizes. -A Frisbee: A Frisbee is a durable, safe, and fun toy for new dog owners. Frisbees come in a variety of colors and sizes, and they are perfect for dogs of all sizes. -A ball: A ball is a durable, safe, and fun toy for new dog owners. Balls come in a variety of colors and sizes, and they are perfect for dogs of all sizes. -A rope toy: A rope toy is a durable, safe, and fun toy for new dog owners. Rope toys come in a variety of colors and sizes, and they are perfect for dogs of all sizes. -A chew toy: A chew toy is a durable, safe, and fun toy for new dog owners. Chew toys come in a variety of shapes and sizes, and they are perfect for dogs of all sizes. -A bone: A bone is a durable, safe, and fun toy for new dog owners. Bones come in a variety of shapes and sizes, and they are perfect for dogs of all sizes. -A toy that is made of a durable material: A toy that is made of a durable material is a durable, safe, and fun toy for new dog owners. Toys that are made of a durable material come in a variety of shapes and sizes, and they are perfect for dogs of all"

Expected behavior

for the same prompt,these two method should have the same output

actual behavior

outputs are different when using the same prompt

additional notes

no

byshiue commented 1 month ago

Is it related to the add_special_tokens? Could you check the input ids of two case?

Graham1025 commented 1 month ago

yes, it is add_special_tokens problem,after setting add_special_tokens = True on triton server,triton server output is the same as run.py engine! Thanks.

------------------ 原始邮件 ------------------ 发件人: "triton-inference-server/tensorrtllm_backend" @.>; 发送时间: 2024年5月10日(星期五) 中午11:25 @.>; @.**@.>; 主题: Re: [triton-inference-server/tensorrtllm_backend] There is a problem with llama 7B model pre-processing after using triton server (Issue #445)

Is it related to the add_special_tokens? Could you check the input ids of two case?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Graham1025 commented 1 month ago

By the way, could you help to tell how to enable generate_stream func in in self._spec_generate method? Thanks!

------------------ 原始邮件 ------------------ 发件人: "三思抽一虑,二也没关系" @.>; 发送时间: 2024年5月10日(星期五) 中午1:42 @.>;

主题: 回复: [triton-inference-server/tensorrtllm_backend] There is a problem with llama 7B model pre-processing after using triton server (Issue #445)

yes, it is add_special_tokens problem,after setting add_special_tokens = True on triton server,triton server output is the same as run.py engine! Thanks.

------------------ 原始邮件 ------------------ 发件人: "triton-inference-server/tensorrtllm_backend" @.>; 发送时间: 2024年5月10日(星期五) 中午11:25 @.>; @.**@.>; 主题: Re: [triton-inference-server/tensorrtllm_backend] There is a problem with llama 7B model pre-processing after using triton server (Issue #445)

Is it related to the add_special_tokens? Could you check the input ids of two case?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

byshiue commented 1 month ago

The issue is fixed in update of latest main branch. Close this issue.

For generate_stream, I am not sure where do you mean. To prevent confuse, we should prevent discuss different topics in one issue. Could you create another issue to discuss your question?

Graham1025 commented 1 month ago

The issue is fixed in update of latest main branch. Close this issue.

For generate_stream, I am not sure where do you mean. To prevent confuse, we should prevent discuss different topics in one issue. Could you create another issue to discuss your question?

Got it