Description
The nv_inference_pending_request_count metric exported by tritonserver is incorrect in ensemble_stream mode.
The ensemble_stream pipeline contains 3 steps: preprocess, fastertransformer, and postprocess.
I set up a single triton backend instance to receive concurrent requests, that is, instance_group{count = 1} on one gpu device. The nv_inference_pending_request_count metric obtained from the metrics endpoint shows that the pending_count of the fastertransfomer_decouple model is always 0, and the pending_count of postprocess backend pending exceeds total number of returned responses.
To Reproduce
You can construct three models to represent the three steps of preprocess, model inference and postprocess in ensemble mode. Among them, preprocess and postprocess take very little time, and model inference takes a lot of time.
The following is the config configuration related to scheduling.
Expected behavior
The expected result is that with a single triton backend instance, the queuing number of the time-consuming fastertransfomer model is 1 less than the number of concurrent requests, while the queuing number of the fast preprocess model and postprocess model is close to 0
Description The
nv_inference_pending_request_count
metric exported by tritonserver is incorrect in ensemble_stream mode.The ensemble_stream pipeline contains 3 steps: preprocess, fastertransformer, and postprocess.
I set up a single triton backend instance to receive concurrent requests, that is, instance_group{count = 1} on one gpu device. The
nv_inference_pending_request_count
metric obtained from the metrics endpoint shows that the pending_count of the fastertransfomer_decouple model is always 0, and the pending_count of postprocess backend pending exceeds total number of returned responses.Triton Information I use version 23.09 of the Triton image to server requests. fastertransformer_backend is slightly modified based on the open source framework(https://github.com/triton-inference-server/fastertransformer_backend).
To Reproduce You can construct three models to represent the three steps of preprocess, model inference and postprocess in ensemble mode. Among them, preprocess and postprocess take very little time, and model inference takes a lot of time.
The following is the config configuration related to scheduling.
ensemble_stream/config.pbtxt:
preprocessing/config.pbtxt
fastertransformer_decouple/config.pbtxt
postprocessing/config.pbtxt
Expected behavior The expected result is that with a single triton backend instance, the queuing number of the time-consuming fastertransfomer model is 1 less than the number of concurrent requests, while the queuing number of the fast preprocess model and postprocess model is close to 0