triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
692 stars 103 forks source link

Add usage in response like openai? #202

Open npuichigo opened 10 months ago

npuichigo commented 10 months ago

https://platform.openai.com/docs/api-reference/completions/object#completions/object-usage What about add usage in trt ensemble models to return the token usage like openai? At lease the prompt and output token length. It would be eaiser to provide an OpenAI compatible API.

shatealaboxiaowang commented 9 months ago

https://platform.openai.com/docs/api-reference/completions/object#completions/object-usage What about add usage in trt ensemble models to return the token usage like openai? At lease the prompt and output token length. It would be eaiser to provide an OpenAI compatible API.

Have you solved the problem?

npuichigo commented 9 months ago

not yet

shatealaboxiaowang commented 9 months ago

not yet

Do you know how to do it? Any ideas?

npuichigo commented 9 months ago

I think u could customize the logic in postprocess and preprocess to do the calculation.

shatealaboxiaowang commented 9 months ago

I think u could customize the logic in postprocess and preprocess to do the calculation.

Thank you. I tried. It didn't work

michaelnny commented 5 months ago

I managed to get the output_token_len to the output, but can't add the input_token_len since this information is not directly passed down from the pipeline to the postprocessing model.

Here's how to do it:

We need to create a new output field in the postprocessing model, and make small changes to the code to handle the information retrieval and output.

The first step is to modify the postprocessing\config.pbtxt, add the following content:


output [
  {
    name: "OUTPUT_TOKEN_LEN"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },

  ...

]

Then we need to chagne postprocessing\1\model.py to add logic to output the tensor corresponding to the above output field.


class TritonPythonModel:

    def initialize(self, args):

        ...

        # Parse model output configs
        output_names = ["OUTPUT", "OUTPUT_TOKEN_LEN"]
        for output_name in output_names:
            setattr(
                self,
                output_name.lower() + "_dtype",
                pb_utils.triton_string_to_numpy(
                    pb_utils.get_output_config_by_name(
                        model_config, output_name)['data_type']))

    def execute(self, requests):

        ...

        # Number of tokens
        output_token_len_tensor = pb_utils.Tensor(
            'OUTPUT_TOKEN_LEN',
            np.array(sequence_lengths).astype(self.output_token_len_dtype))
        outputs.append(output_token_len_tensor)

Then, we can modify the ensemble\config.pbtxt, add the new output field to both output fields and the ensemble pipeline, as shown in the following content:

output[
  {
    name: "output_token_len"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },

  ...

]

ensemble_scheduling {
  step [
      {
      model_name: "postprocessing"
      model_version: -1

      ...

      output_map {
        key: "OUTPUT_TOKEN_LEN"
        value: "output_token_len"
      }
  ]
}
MrD005 commented 3 months ago

you can use https://github.com/npuichigo/openai_trtllm , it is a wrapper to create openai compatible api for tensorRT-LLM