triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.2k stars 1.46k forks source link

nv_inference_pending_request_count metric exported in 23.09 container is incorrect #6494

Open hxer7963 opened 11 months ago

hxer7963 commented 11 months ago

Description The nv_inference_pending_request_count metric exported by tritonserver is incorrect in ensemble_stream mode.

The ensemble_stream pipeline contains 3 steps: preprocess, fastertransformer, and postprocess.

I set up a single triton backend instance to receive concurrent requests, that is, instance_group{count = 1} on one gpu device. The nv_inference_pending_request_count metric obtained from the metrics endpoint shows that the pending_count of the fastertransfomer_decouple model is always 0, and the pending_count of postprocess backend pending exceeds total number of returned responses.

Triton Information I use version 23.09 of the Triton image to server requests. fastertransformer_backend is slightly modified based on the open source framework(https://github.com/triton-inference-server/fastertransformer_backend).

To Reproduce You can construct three models to represent the three steps of preprocess, model inference and postprocess in ensemble mode. Among them, preprocess and postprocess take very little time, and model inference takes a lot of time.

The following is the config configuration related to scheduling.

ensemble_stream/config.pbtxt:

name: "ensemble_stream"
platform: "ensemble"
max_batch_size: 2
...

preprocessing/config.pbtxt

name: "preprocessing"
backend: "python"
max_batch_size: 2
...
instance_group [
    {
        count: 1
        kind: KIND_CPU
    }
]

fastertransformer_decouple/config.pbtxt

name: "fastertransformer_decouple"
backend: "fastertransformer"
default_model_filename: "llama"
max_batch_size: 2

model_transaction_policy {
  decoupled: True
}
...
instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]

postprocessing/config.pbtxt

name: "postprocessing"
backend: "python"
max_batch_size: 2
...
instance_group [
    {
        count: 1
        kind: KIND_CPU
    }
]

Expected behavior The expected result is that with a single triton backend instance, the queuing number of the time-consuming fastertransfomer model is 1 less than the number of concurrent requests, while the queuing number of the fast preprocess model and postprocess model is close to 0

dafu-wu commented 4 months ago

Hi @rmccorm4 I encountered the same problem. The stress test concurrency was very high. nv_inference_pending_request_count was still 0.

dwq370 commented 2 months ago

i meet the same problem in tensorrtllm_backend,any solutions?