Closed SunnyGhj closed 3 weeks ago
Hi @SunnyGhj, thanks for filing an issue.
Do you have any experiments or data showing this as a bottleneck? And have you tried modifying the code to see if it improves the throughput?
CC @tanmayv25 if you have any thoughts
I don't understand why the number of payloads that can be pre-fetched is limited to twice the number of model instances, I think it will block the processing of inference requests when there are a lot of requests arriving at the same time.
The number of model instances controls how many requests are within TRITONBACKEND_ModelInstanceExecute call. We have a single backend thread per model instance which picks one of the pre-fetched payload. The payloads are pre-fetched to overlap the batching logic with the inference execution on backend thread. This way the backend thread can receive a payload with already batched requests when it returns from executing the requests. The limit of (2 * num_model_instance) payload count ensures that given the number of inflight requests, we have sufficient payloads ready to not block the thread. The factor of 2 makes it more plausible.
We don't form payloads right away upon receiving the requests for following reasons:
Let us know if you are observing something otherwise and backend threads are still starving for requests in your experiments/use.
Closing due to in-activity.
Description https://github.com/triton-inference-server/core/blob/bbcd7816997046821f9d1a22e418acb84ca5364b/src/rate_limiter.cc#L208
I don't understand why the number of payloads that can be pre-fetched is limited to twice the number of model instances, I think it will block the processing of inference requests when there are a lot of requests arriving at the same time.
Triton Information nvcr.io/nvidia/tritonserver:23.10-py3
Are you using the Triton container or did you build it yourself? Triton container from nvcr.io/nvidia/tritonserver:23.10-py3