triton-inference-server / client

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.
BSD 3-Clause "New" or "Revised" License
528 stars 225 forks source link

Update default behavior for max threads PA async mode #737

Closed lkomali closed 1 week ago

lkomali commented 2 weeks ago

Currently, max_threads in PA is by default set to 16 in async mode. While using genai-perf, concurrency value can be passed as a CLI argument. However, the max_threads used by PA is still 16 although concurrency is set to a higher value. This change sets max_threads to concurrency if concurrency > 16. If concurrency <= 16, max_threads is by default set to 16.

dyastremsky commented 2 weeks ago

Based on a comment from @nv-hwoo, it sounds like this fixes async mode whereas sync mode already had similar behavior. If so, can you please specify that in the PR title and description?