triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
581 stars 81 forks source link

[Bug] Output generation does not stop at stop token </s> #486

Closed Hao-YunDeng closed 3 weeks ago

Hao-YunDeng commented 3 weeks ago

System Info

GPU: NVIDIA A100 Driver Version: 545.23.08 CUDA: 12.3 versions:

https://github.com/NVIDIA/TensorRT-LLM.git (https://github.com/NVIDIA/TensorRT-LLM/tree/f430a4b447ef4cba22698902d43eae0debf08594) https://github.com/triton-inference-server/tensorrtllm_backend.git (https://github.com/triton-inference-server/tensorrtllm_backend/commit/75b0964792fdc2a9e620b7a1edd71657d0d5cf62)

Model: zephyr-7b-beta (finetuned with internal data, no changes to tokenizer or model config)

Who can help?

@kaiyux @byshiue

Information

Tasks

Reproduction

step 1:

python3 ./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir zephyr-7b-beta --output_dir zephyr-7b-beta-converted --dtype float16

step 2:

trtllm-build --checkpoint_dir zephyr-7b-beta-converted --output_dir zephyr-7b-beta-trt-engine --remove_input_padding enable --context_fmha enable --gpt_attention_plugin float16 --gemm_plugin float16 --paged_kv_cache enable --max_num_tokens 65536 --max_batch_size 32 --max_input_len 16384 --strongly_typed

step 3:

CUDA_VISIBLE_DEVICES=1 \ python3 tensorrtllm_backend/tensorrt_llm/examples/run.py \ --max_output_len=500 \ --tokenizer_dir zephyr-7b-beta \ --engine_dir zephyr-7b-beta-trt-engine \ --max_attention_window_size=4096 \ --input_text="my name is"

Expected behavior

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 Input [Text 0]: "\ my name is" Output [Text 0 Beam 0]: "john and i am a software engineer. I have been working in the software industry for 5 years now and have developed various applications for different clients. My expertise lies in developing web applications using modern technologies such as React, Node.js, and MongoDB. I am passionate about building user-friendly interfaces and ensuring that the applications I develop are scalable and maintainable. I am always up for a new challenge and am excited to work on new projects that will push me to learn and grow as a software engineer.

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 Input [Text 0]: "\ my name is" Output [Text 0 Beam 0]: "john and i am a software engineer. I have been working in the software industry for 5 years now and have developed various applications for different clients. My expertise lies in developing web applications using modern technologies such as React, Node.js, and MongoDB. I am passionate about building user-friendly interfaces and ensuring that the applications I develop are scalable and maintainable. I am always up for a new challenge and am excited to work on new projects that will push me to learn and grow as a software engineer.\\>>

Hello John! It's great to have you on board. Can you tell us more about your experience with React and Node.js?\>>

Hello! Thank you for having me. React is a great technology for building user interfaces, and I have worked on several projects using it. I have also worked with Node.js for building backend services and APIs. I am familiar with the latest versions of both technologies and have experience with their ecosystems as well. I am always looking for ways to improve my skills and stay up-to-date with the latest developments in the industry.\>\>>

That's impressive! Can you walk us through a project you have worked on using React and Node.js?\>>

Sure! One project I worked on involved building a web application for a client that needed to track their inventory levels. The application needed to display real-time inventory data and allow users to make changes to the data. I used React for the frontend and Node.js for the backend. I also used MongoDB as the database for the application. The application was built using a modular architecture, with each component being responsible for a specific part of the application. The user interface was designed to be intuitive and user-friendly, with a focus on simplicity and ease of use. The application was also designed to be scalable and maintainable, with a clear separation of concerns between the frontend and backend. Overall, it was a challenging project, but I am proud of the end result and the skills I gained from working on it.\>>

That sounds like a great project! Can you tell us more about how you approached the design of the user interface?\>>

Sure! When designing the user interface, I focused on creating a"

additional notes

Original zephyr model works well, but this error happens to our finetuned model.

Our custom model was only finetuned on our data using FastChat framework. In our finetuning data, there is NOT "\" at the end of any training data.

We tried adding stop_words or stop_token to request payload, but they did not work.

Is there any other quick way to avoid this bug? Please let us know immediately.

hijkzzz commented 3 weeks ago

Link: https://github.com/NVIDIA/TensorRT-LLM/issues/1711

byshiue commented 3 weeks ago

Close this issue since it is not a bug of backend.