triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.99k stars 1.44k forks source link

The input dimensions received by subsequent nodes in ensemble mode are incorrect #7383

Closed SeibertronSS closed 1 month ago

SeibertronSS commented 2 months ago

I built an LLM inference topology, including preprocessing inference and postprocessing. Each time the inference node only outputs the latest token_id to the postprocessing node, but sometimes the postprocessing node receives a lot of token_ids at one time, for example:

[8908, 8908, 234, 8908, 8908, 234, 114, 8908, 8908, 234, 8908, 8908, 234, 114, 103081, 8908, 8908, 234, 8908, 8908, 234, 114, 8908, 8908, 234, 8908, 8908, 234, 114, 103081, 99662, 8908, 8908, 234, 8908, 8908, 234, 114, 8908, 8908, 234, 8908, 8908, 234, 114, 103081, 8908, 8908, 234, 8908, 8908, 234, 114, 8908, 8908, 234, 8908, 8908, 234, 114, 103081, 99662, 99808, 8908, 8908, 234, 8908, 8908, 234, 114, 8908, 8908, 234, 8908, 8908, 234, 114, 103081, 8908, 8908, 234, 8908, 8908, 234, 114, 8908, 8908, 234, 8908, 8908, 234, 114, 103081, 99662, 8908, 8908, 234, 8908, 8908, 234, 114, 8908, 8908, 234, 8908, 8908, 234, 114, 103081, 8908, 8908, 234, 8908, 8908, 234, 114, 8908, 8908, 234, 8908, 8908, 234, 114, 103081, 99662, 99808, 99219, 9909]

When I request the inference node alone, I don't receive a similar response. This phenomenon is very similar to the duplication of memory, and the dimension of the token_id received by postprocessing will be doubled with each iteration of the model, and finally a token_id with billions of dimensions will be obtained.

statiraju commented 1 month ago

kindly provide the following information:

Description A clear and concise description of what the bug is.

Triton Information What version of Triton are you using?

Are you using the Triton container or did you build it yourself?

To Reproduce Steps to reproduce the behavior.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Expected behavior A clear and concise description of what you expected to happen.

SeibertronSS commented 1 month ago

I found out that the reason was a bug in my custom backend