triton-inference-server / client

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.
BSD 3-Clause "New" or "Revised" License
551 stars 227 forks source link

fix: Fix Client Http Async Code for Request Rate #684

Closed nnshah1 closed 3 months ago

nnshah1 commented 4 months ago

Changed to use multi_curl_poll and multi_curl_wakeup to allow for handling of new requests and ongoing requests.

1) Load generation thread adds requests into a mutex protected map. 2) Load generation thread calls multi_curl_wakeup 3) AsyncTransferThread adds new requests to multi_handle 4) AsyncTransferThread reads / writes to requests via curl_multi_perform 5) AsyncTransferThread processes complete requests via curl_multi_info_read 6) AsyncTransferThread waits for new data or new requests or exit signal using curl_multi_poll 7) When exiting, AsyncTransferThread removes and releases requests added to multi_handle

Note: Load generation thread only initializes and calls curl_multi_wakeup on multi_handle, all other operations are done in the AsyncTransferThread

--

Similar changes were added to the HTTP client (#686 and #691) Since async requests have more variable latency, especially at higher request rates, stabilizatio no longer requires latency to stabilize and PA will now instead print an error if it does not stabilize. (#688)