Got repeated answer while deploying LLaMA3-Instruct-8B model in triton server

AndyZZt commented 3 weeks ago

System Info

CPU architecture: x86_64 CPU/Host memory size: 32G GPU properties: SM86 GPU name: NVIDIA A10 GPU memory size: 24G Clock frequencies used: 1695MHz

Libraries

TensorRT-LLM: v0.9.0 TensorRT: 9.3.0.post12.dev1 (display 8.6.3 while input "dpkg -l | grep nvinfer" in cmd),
CUDA: 12.3, Container used : 24.04-trtllm-python-py3 NVIDIA driver version: 535.161.08 OS : Ubuntu 22.04

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

docker exec -it trtllm1 /bin/bash
mamba deactivate
mamba deactivate

# git from correct branch
git clone -b v0.9.0 https://github.com/NVIDIA/TensorRT-LLM.git  
git clone -b v0.9.0 https://github.com/triton-inference-server/tensorrtllm_backend.git  

# build trt engines
cd TensorRT-LLM
trtllm-build --checkpoint_dir ../Work/TensorRT-LLM/examples/llama/tllm_checkpoint_1gpu_tp1 \
            --output_dir ./tmp/llama/8B/trt_engines/fp16/1-gpu/ \
            --remove_input_padding enable \
            --gpt_attention_plugin float16 --gemm_plugin float16 \
            --context_fmha enable --paged_kv_cache enable \
            --streamingllm enable \
            --use_paged_context_fmha enable --enable_chunked_context\
            --use_context_fmha_for_generation enable \
            --max_input_len 512 --max_output_len 512 \
            --max_batch_size 64

# copy rank0.engine & config.json
cd ../tensorrtllm_backend
cp ../TensorRT-LLM/tmp/llama/8B/trt_engines/fp16/1-gpu/* all_models/inflight_batcher_llm/tensorrt_llm/1/

# model configuration
export HF_LLAMA_MODEL=/path/to/llama3-8B-Instruct-hf
export ENGINE_PATH=/path/to/tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:True,repetition_penalty:0.9,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,decoding_mode:top_p,enable_chunked_context:True,batch_scheduler_policy:max_utilization,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.9,exclude_input_in_output:True,enable_kv_cache_reuse:True,batching_strategy:v1,enable_trt_overlap:True,max_queue_delay_microseconds:0

# launch triton-server
python3 scripts/launch_triton_server.py --model_repo=all_models/inflight_batcher_llm --world_size 1

# send request via curl
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "what are flowers","max_tokens": 100,"bad_words":[""],"stop_words":["<|eot_id|>"]}'

Expected behavior

Answer the question normally.

actual behavior

Got a repeated answer which are usually less than threes words

additional notes

When I simply run TensorRT-LLM locally to infer as the example in TensorRT-LLM repository shows, which is : python3 ../run.py --tokenizer_dir ./tmp/llama/8B/ \ --engine_dir=./tmp/llama/8B/trt_engines/fp16/1-gpu/ \ --input_text "How to build tensorrt engine?" \ --max_output_len 100 The model can answer normally.

hijkzzz commented 3 weeks ago

Link: https://github.com/NVIDIA/TensorRT-LLM/issues/1713

byshiue commented 3 weeks ago

It looks this issue is not related to backend, but some bugs about TRT-LLM core. So, close this bug and continue discussing in https://github.com/NVIDIA/TensorRT-LLM/issues/1713.

triton-inference-server / tensorrtllm_backend