triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
697 stars 103 forks source link

triton prefix the input text to output #145

Open forrestjgq opened 11 months ago

forrestjgq commented 11 months ago

When I run this query in tensorrt-llm with llama-2-7b-chat:

<s>[INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s>[INST] Where was it played? [/INST]

it responds with:

 The 2020 World Series was played at Globe Life Park in Arlington, Texas, and Minute Maid Park in Houston, Texas.</s></s></s></s>....

but if I query triton thru HTTP by generate api:

{
  "text_input": "<s>[INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s>[INST] Where was it played? [/INST] ",
  "max_tokens": 1000,
  "bad_words": "",
  "stop_words": "",
  "top_p": 1,
  "temperature": 1,
  "presence_penalty": 0
}

it responds:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "<s><s> [INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s> [INST] Where was it played? [/INST]  The 2020 World Series was played between the Tampa Bay Rays and the Los Angeles Dodgers. The series was held at Globe Life Field in Arlington, Texas, and the Dodgers won the series 4 games to 2.</s><s> geometry-help\nA regular hexagon has a perimeter of 64 cm. If the length of one side is 8 cm, what is the length of the other five sides? </s></s> ....garbage...",
        "role": ""
      }
    }
  ],
  "created": 1700447258,
  "id": "",
  "model": "gpt-3.5-turbo",
  "object": "chat.completion"
}

seems the input text is inserted before output text and after an extra bos how to fix this?

ps: Please ignore the text after <s> geometry-help\nA regular hexagon which shoud be garbage data and may be a bug of trt-llm

byshiue commented 11 months ago

I am not sure what do you mean for "seems the input text is inserted before input text and after an extra bos". But in general, TRT LLM's response are <input_text> <output_text> <padding> and the result you share looks reasonsable for me.

In TensorRT-LLM run.py, we have added a slice to only print the <output_text>.

Lzhang-hub commented 11 months ago

In TensorRT-LLM run.py, we have added a slice to only print the <output_text>. @byshiue But I can not find some config in triton server to only response the , can you give some advices or docs, thanks.

forrestjgq commented 11 months ago

I am not sure what do you mean for "seems the input text is inserted before input text and after an extra bos". But in general, TRT LLM's response are <input_text> <output_text> <padding> and the result you share looks reasonsable for me.

In TensorRT-LLM run.py, we have added a slice to only print the <output_text>.

sorry for the typo, it should be input text is inserted before output text

it is more reasonable to respond only the output text, which will be the only information we need to know. Or like @Lzhang-hub says, a configuration for this will be nice!

byshiue commented 11 months ago

Could you try setting exclude_input_in_output (https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt#L261-L265) be True on latest main branch?

forrestjgq commented 11 months ago

Could you try setting exclude_input_in_output (https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt#L261-L265) be True on latest main branch?

like this?

parameters: {
  key: "exclude_input_in_output"
  value: {
    string_value: "True"
  }
}

@byshiue

byshiue commented 11 months ago

Yes.

forrestjgq commented 11 months ago

@byshiue I updated the backend to main along with tensorrt-llm, re-build the docker image myself using dockerfile/Dockerfile.trt_llm_backend, and run triton just like README described(which is working for 0.5 release), and triton server reports errors on parsing pbtxt files in model repo.

I checked and find that few changes has been made to pbtxt configs like max_batch_size, max_beam_width, max_kv_cache_length, I assume they are not replaced with correct values in runtime.

How to fix it?

byshiue commented 11 months ago

@byshiue I updated the backend to main along with tensorrt-llm, re-build the docker image myself using dockerfile/Dockerfile.trt_llm_backend, and run triton just like README described(which is working for 0.5 release), and triton server reports errors on parsing pbtxt files in model repo.

I checked and find that few changes has been made to pbtxt configs like max_batch_size, max_beam_width, max_kv_cache_length, I assume they are not replaced with correct values in runtime.

How to fix it?

You could set the number of max_batch_size, max_beam_width and max_input_len + max_output_len of your engine building.

shatealaboxiaowang commented 10 months ago

When I run this query in tensorrt-llm with llama-2-7b-chat:

<s>[INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s>[INST] Where was it played? [/INST]

it responds with:

 The 2020 World Series was played at Globe Life Park in Arlington, Texas, and Minute Maid Park in Houston, Texas.</s></s></s></s>....

but if I query triton thru HTTP by generate api:

{
  "text_input": "<s>[INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s>[INST] Where was it played? [/INST] ",
  "max_tokens": 1000,
  "bad_words": "",
  "stop_words": "",
  "top_p": 1,
  "temperature": 1,
  "presence_penalty": 0
}

it responds:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "<s><s> [INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s> [INST] Where was it played? [/INST]  The 2020 World Series was played between the Tampa Bay Rays and the Los Angeles Dodgers. The series was held at Globe Life Field in Arlington, Texas, and the Dodgers won the series 4 games to 2.</s><s> geometry-help\nA regular hexagon has a perimeter of 64 cm. If the length of one side is 8 cm, what is the length of the other five sides? </s></s> ....garbage...",
        "role": ""
      }
    }
  ],
  "created": 1700447258,
  "id": "",
  "model": "gpt-3.5-turbo",
  "object": "chat.completion"
}

seems the input text is inserted before output text and after an extra bos how to fix this?

ps: Please ignore the text after <s> geometry-help\nA regular hexagon which shoud be garbage data and may be a bug of trt-llm

Has the problem been solved?

byshiue commented 9 months ago

When I run this query in tensorrt-llm with llama-2-7b-chat:

<s>[INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s>[INST] Where was it played? [/INST]

it responds with:

 The 2020 World Series was played at Globe Life Park in Arlington, Texas, and Minute Maid Park in Houston, Texas.</s></s></s></s>....

but if I query triton thru HTTP by generate api:

{
  "text_input": "<s>[INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s>[INST] Where was it played? [/INST] ",
  "max_tokens": 1000,
  "bad_words": "",
  "stop_words": "",
  "top_p": 1,
  "temperature": 1,
  "presence_penalty": 0
}

it responds:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "<s><s> [INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s> [INST] Where was it played? [/INST]  The 2020 World Series was played between the Tampa Bay Rays and the Los Angeles Dodgers. The series was held at Globe Life Field in Arlington, Texas, and the Dodgers won the series 4 games to 2.</s><s> geometry-help\nA regular hexagon has a perimeter of 64 cm. If the length of one side is 8 cm, what is the length of the other five sides? </s></s> ....garbage...",
        "role": ""
      }
    }
  ],
  "created": 1700447258,
  "id": "",
  "model": "gpt-3.5-turbo",
  "object": "chat.completion"
}

seems the input text is inserted before output text and after an extra bos how to fix this? ps: Please ignore the text after <s> geometry-help\nA regular hexagon which shoud be garbage data and may be a bug of trt-llm

Has the problem been solved?

The outputs are correct and expected, and you could only get output by setting exclude_input_in_output as mentioned above. What's your question?

shatealaboxiaowang commented 9 months ago

When I run this query in tensorrt-llm with llama-2-7b-chat:

<s>[INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s>[INST] Where was it played? [/INST]

it responds with:

 The 2020 World Series was played at Globe Life Park in Arlington, Texas, and Minute Maid Park in Houston, Texas.</s></s></s></s>....

but if I query triton thru HTTP by generate api:

{
  "text_input": "<s>[INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s>[INST] Where was it played? [/INST] ",
  "max_tokens": 1000,
  "bad_words": "",
  "stop_words": "",
  "top_p": 1,
  "temperature": 1,
  "presence_penalty": 0
}

it responds:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "<s><s> [INST] <<SYS>>\n\n You are a helpful assistant.\n\n <</SYS>>\n\n Who won the world series in 2020? [/INST] The Los Angeles Dodgers won the World Series in 2020. </s><s> [INST] Where was it played? [/INST]  The 2020 World Series was played between the Tampa Bay Rays and the Los Angeles Dodgers. The series was held at Globe Life Field in Arlington, Texas, and the Dodgers won the series 4 games to 2.</s><s> geometry-help\nA regular hexagon has a perimeter of 64 cm. If the length of one side is 8 cm, what is the length of the other five sides? </s></s> ....garbage...",
        "role": ""
      }
    }
  ],
  "created": 1700447258,
  "id": "",
  "model": "gpt-3.5-turbo",
  "object": "chat.completion"
}

seems the input text is inserted before output text and after an extra bos how to fix this? ps: Please ignore the text after <s> geometry-help\nA regular hexagon which shoud be garbage data and may be a bug of trt-llm

Has the problem been solved?

The outputs are correct and expected, and you could only get output by setting exclude_input_in_output as mentioned above. What's your question?

version: 0.6.1 server start command is: CUDA_VISIBLE_DEVICES=1 python3 scripts/launch_triton_server.py --world_size 1 --model_repo /tensorrtllm_backend/triton_model_repo/ client request command is: curl -X POST localhost:8035/v2/models/ensemble/generate -d '{"text_input": "import numpy", "max_tokens": 50, "bad_words": "", "stop_words":"\n"}' response is: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"<|begin▁of▁sentence|>import numpy as np\n\n"}

I want to format the returned content like openai, but i don't know what to change.