triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
664 stars 96 forks source link

Filtering beam_search output tensors results in a string output vs list #418

Open nikhilshandilya opened 5 months ago

nikhilshandilya commented 5 months ago

We have implemented a custom postprocessing step in beam search decoding where we filter some outputs out of the final beam output. In a case where we are left with only one output, we expect the response to be a list of strings of length 1. But instead in this case we see the response is a string. This is leading to issues in the client for parsing the response.

For example, output with more than one suggestion:

curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d '{"input":"nik"}'                             

{"model_name":"tensorrt_llm_bls","model_version":"1","output":["nike socks","nike sweatpants","nike sweatshirt","nike hoodie","nike womens sneakers","nike sweatpants for men","nike womens sweatpants","nike sweatshirt men","nike shoe laces","nike air max 270 men"]}% 

We see that output is a list of strings, however if we filter the output and in some cases the number of outputs is one. We see a string output.

curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d '{"input":"adult t shirts with dogs on them"}'

{"model_name":"tensorrt_llm_bls","model_version":"1","output":"adult t shirts with dogs on them"}%

In this case the output is a string rather than a list of string ie we expect ["adult t shirts with dogs on them"] and not "adult t shirts with dogs on them"

How do we ensure the output is a list everytime?

config.pbtxt for tensorrt_llm_bls

name: "tensorrt_llm_bls"
backend: "python"
max_batch_size: 1

model_transaction_policy {
  decoupled: false
}

input [
  {
    name: "input"
    data_type: TYPE_STRING
    dims: [ -1 ]
  },
  {
    name: "context"
    data_type: TYPE_STRING
    dims: [ -1 ]
    optional: true
  },
  {
    name: "task"
    data_type: TYPE_STRING
    dims: [ -1 ]
    optional: true
  }
]
output [
  {
    name: "output"
    data_type: TYPE_STRING
    dims: [ -1 ]
  }
]

parameters: {
  key: "accumulate_tokens"
  value: {
    string_value: "${accumulate_tokens}"
  }
}

instance_group [
  {
    count: 1
    kind : KIND_CPU
  }
]
byshiue commented 5 months ago

Could you explain more how do you filter the outputs?