triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.12k stars 1.46k forks source link

A Confusion about prefetch #7282

Closed SunnyGhj closed 3 weeks ago

SunnyGhj commented 4 months ago

Description image https://github.com/triton-inference-server/core/blob/bbcd7816997046821f9d1a22e418acb84ca5364b/src/rate_limiter.cc#L208

I don't understand why the number of payloads that can be pre-fetched is limited to twice the number of model instances, I think it will block the processing of inference requests when there are a lot of requests arriving at the same time.

Triton Information nvcr.io/nvidia/tritonserver:23.10-py3

Are you using the Triton container or did you build it yourself? Triton container from nvcr.io/nvidia/tritonserver:23.10-py3

rmccorm4 commented 4 months ago

Hi @SunnyGhj, thanks for filing an issue.

Do you have any experiments or data showing this as a bottleneck? And have you tried modifying the code to see if it improves the throughput?

CC @tanmayv25 if you have any thoughts

tanmayv25 commented 4 months ago

I don't understand why the number of payloads that can be pre-fetched is limited to twice the number of model instances, I think it will block the processing of inference requests when there are a lot of requests arriving at the same time.

The number of model instances controls how many requests are within TRITONBACKEND_ModelInstanceExecute call. We have a single backend thread per model instance which picks one of the pre-fetched payload. The payloads are pre-fetched to overlap the batching logic with the inference execution on backend thread. This way the backend thread can receive a payload with already batched requests when it returns from executing the requests. The limit of (2 * num_model_instance) payload count ensures that given the number of inflight requests, we have sufficient payloads ready to not block the thread. The factor of 2 makes it more plausible.

We don't form payloads right away upon receiving the requests for following reasons:

  1. Waiting to form payloads when needed increases the chances of forming larger batch of requests.
  2. The queue policies can be applied on the requests that are not yet pre-fetched.
  3. Having a limit ensures that we don't add CPU overhead by a continuously running batcher thread that keeps updating the payload based upon the batching logic.

Let us know if you are observing something otherwise and backend threads are still starving for requests in your experiments/use.

Tabrizian commented 3 weeks ago

Closing due to in-activity.