Open forrestjgq opened 11 months ago
I am not sure what do you mean for "seems the input text is inserted before input text and after an extra bos". But in general, TRT LLM's response are <input_text> <output_text> <padding>
and the result you share looks reasonsable for me.
In TensorRT-LLM run.py, we have added a slice to only print the <output_text>
.
In TensorRT-LLM run.py, we have added a slice to only print the <output_text>.
@byshiue But I can not find some config in triton server to only response the
I am not sure what do you mean for "seems the input text is inserted before input text and after an extra bos". But in general, TRT LLM's response are
<input_text> <output_text> <padding>
and the result you share looks reasonsable for me.In TensorRT-LLM run.py, we have added a slice to only print the
<output_text>
.
sorry for the typo, it should be input text is inserted before output text
it is more reasonable to respond only the output text, which will be the only information we need to know. Or like @Lzhang-hub says, a configuration for this will be nice!
Could you try setting exclude_input_in_output
(https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt#L261-L265) be True on latest main branch?
Could you try setting
exclude_input_in_output
(https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt#L261-L265) be True on latest main branch?
like this?
parameters: {
key: "exclude_input_in_output"
value: {
string_value: "True"
}
}
@byshiue
Yes.
@byshiue I updated the backend to main along with tensorrt-llm, re-build the docker image myself using dockerfile/Dockerfile.trt_llm_backend
, and run triton just like README described(which is working for 0.5 release), and triton server reports errors on parsing pbtxt files in model repo.
I checked and find that few changes has been made to pbtxt configs like max_batch_size
, max_beam_width
, max_kv_cache_length
, I assume they are not replaced with correct values in runtime.
How to fix it?
@byshiue I updated the backend to main along with tensorrt-llm, re-build the docker image myself using
dockerfile/Dockerfile.trt_llm_backend
, and run triton just like README described(which is working for 0.5 release), and triton server reports errors on parsing pbtxt files in model repo.I checked and find that few changes has been made to pbtxt configs like
max_batch_size
,max_beam_width
,max_kv_cache_length
, I assume they are not replaced with correct values in runtime.How to fix it?
You could set the number of max_batch_size
, max_beam_width
and max_input_len + max_output_len
of your engine building.
When I run this query in tensorrt-llm with llama-2-7b-chat:
<s>[INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s>[INST] Where was it played? [/INST]
it responds with:
The 2020 World Series was played at Globe Life Park in Arlington, Texas, and Minute Maid Park in Houston, Texas.</s></s></s></s>....
but if I query triton thru HTTP by generate api:
{ "text_input": "<s>[INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s>[INST] Where was it played? [/INST] ", "max_tokens": 1000, "bad_words": "", "stop_words": "", "top_p": 1, "temperature": 1, "presence_penalty": 0 }
it responds:
{ "choices": [ { "finish_reason": "stop", "index": 0, "message": { "content": "<s><s> [INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s> [INST] Where was it played? [/INST] The 2020 World Series was played between the Tampa Bay Rays and the Los Angeles Dodgers. The series was held at Globe Life Field in Arlington, Texas, and the Dodgers won the series 4 games to 2.</s><s> geometry-help\nA regular hexagon has a perimeter of 64 cm. If the length of one side is 8 cm, what is the length of the other five sides? </s></s> ....garbage...", "role": "" } } ], "created": 1700447258, "id": "", "model": "gpt-3.5-turbo", "object": "chat.completion" }
seems the input text is inserted before output text and after an extra bos how to fix this?
ps: Please ignore the text after
<s> geometry-help\nA regular hexagon
which shoud be garbage data and may be a bug of trt-llm
Has the problem been solved?
When I run this query in tensorrt-llm with llama-2-7b-chat:
<s>[INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s>[INST] Where was it played? [/INST]
it responds with:
The 2020 World Series was played at Globe Life Park in Arlington, Texas, and Minute Maid Park in Houston, Texas.</s></s></s></s>....
but if I query triton thru HTTP by generate api:
{ "text_input": "<s>[INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s>[INST] Where was it played? [/INST] ", "max_tokens": 1000, "bad_words": "", "stop_words": "", "top_p": 1, "temperature": 1, "presence_penalty": 0 }
it responds:
{ "choices": [ { "finish_reason": "stop", "index": 0, "message": { "content": "<s><s> [INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s> [INST] Where was it played? [/INST] The 2020 World Series was played between the Tampa Bay Rays and the Los Angeles Dodgers. The series was held at Globe Life Field in Arlington, Texas, and the Dodgers won the series 4 games to 2.</s><s> geometry-help\nA regular hexagon has a perimeter of 64 cm. If the length of one side is 8 cm, what is the length of the other five sides? </s></s> ....garbage...", "role": "" } } ], "created": 1700447258, "id": "", "model": "gpt-3.5-turbo", "object": "chat.completion" }
seems the input text is inserted before output text and after an extra bos how to fix this? ps: Please ignore the text after
<s> geometry-help\nA regular hexagon
which shoud be garbage data and may be a bug of trt-llmHas the problem been solved?
The outputs are correct and expected, and you could only get output by setting exclude_input_in_output
as mentioned above. What's your question?
When I run this query in tensorrt-llm with llama-2-7b-chat:
<s>[INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s>[INST] Where was it played? [/INST]
it responds with:
The 2020 World Series was played at Globe Life Park in Arlington, Texas, and Minute Maid Park in Houston, Texas.</s></s></s></s>....
but if I query triton thru HTTP by generate api:
{ "text_input": "<s>[INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s>[INST] Where was it played? [/INST] ", "max_tokens": 1000, "bad_words": "", "stop_words": "", "top_p": 1, "temperature": 1, "presence_penalty": 0 }
it responds:
{ "choices": [ { "finish_reason": "stop", "index": 0, "message": { "content": "<s><s> [INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s> [INST] Where was it played? [/INST] The 2020 World Series was played between the Tampa Bay Rays and the Los Angeles Dodgers. The series was held at Globe Life Field in Arlington, Texas, and the Dodgers won the series 4 games to 2.</s><s> geometry-help\nA regular hexagon has a perimeter of 64 cm. If the length of one side is 8 cm, what is the length of the other five sides? </s></s> ....garbage...", "role": "" } } ], "created": 1700447258, "id": "", "model": "gpt-3.5-turbo", "object": "chat.completion" }
seems the input text is inserted before output text and after an extra bos how to fix this? ps: Please ignore the text after
<s> geometry-help\nA regular hexagon
which shoud be garbage data and may be a bug of trt-llmHas the problem been solved?
The outputs are correct and expected, and you could only get output by setting
exclude_input_in_output
as mentioned above. What's your question?
version: 0.6.1 server start command is: CUDA_VISIBLE_DEVICES=1 python3 scripts/launch_triton_server.py --world_size 1 --model_repo /tensorrtllm_backend/triton_model_repo/ client request command is: curl -X POST localhost:8035/v2/models/ensemble/generate -d '{"text_input": "import numpy", "max_tokens": 50, "bad_words": "", "stop_words":"\n"}' response is: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"<|begin▁of▁sentence|>import numpy as np\n\n"}
I want to format the returned content like openai, but i don't know what to change.
When I run this query in tensorrt-llm with llama-2-7b-chat:
it responds with:
but if I query triton thru HTTP by generate api:
it responds:
seems the input text is inserted before output text and after an extra bos how to fix this?
ps: Please ignore the text after
<s> geometry-help\nA regular hexagon
which shoud be garbage data and may be a bug of trt-llm