Got a repeated answer which are usually less than threes words
additional notes
When I simply run TensorRT-LLM locally to infer as the example in TensorRT-LLM repository shows, which is :
python3 ../run.py --tokenizer_dir ./tmp/llama/8B/ \
--engine_dir=./tmp/llama/8B/trt_engines/fp16/1-gpu/ \
--input_text "How to build tensorrt engine?" \
--max_output_len 100
The model can answer normally.
System Info
CPU architecture: x86_64 CPU/Host memory size: 32G GPU properties: SM86 GPU name: NVIDIA A10 GPU memory size: 24G Clock frequencies used: 1695MHz
Libraries
TensorRT-LLM: v0.9.0 TensorRT: 9.3.0.post12.dev1 (display 8.6.3 while input "dpkg -l | grep nvinfer" in cmd),
CUDA: 12.3, Container used : 24.04-trtllm-python-py3 NVIDIA driver version: 535.161.08 OS : Ubuntu 22.04
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Answer the question normally.
actual behavior
Got a repeated answer which are usually less than threes words![image](https://github.com/triton-inference-server/tensorrtllm_backend/assets/29043558/021465cb-b728-4c93-93fc-0e852f010c42)
additional notes
When I simply run TensorRT-LLM locally to infer as the example in TensorRT-LLM repository shows, which is : python3 ../run.py --tokenizer_dir ./tmp/llama/8B/ \ --engine_dir=./tmp/llama/8B/trt_engines/fp16/1-gpu/ \ --input_text "How to build tensorrt engine?" \ --max_output_len 100 The model can answer normally.