Improve async request rate for OpenAI HTTP client

triton-inference-server / client

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.

BSD 3-Clause "New" or "Revised" License

551 stars 227 forks source link

Improve async request rate for OpenAI HTTP client #680

Closed dyastremsky closed 4 months ago

dyastremsky commented 4 months ago

Resolve resource contention among the mutexes in the OpenAI HTTP client. This speeds up the request rate so that it can asynchronously add requests to the queue with less blocking related to resources. Key changes:

Move from curl_multi_wait_ to _curl_multi_poll and _curl_multi_wakeup to avoid blocking call
Set timeout to polling to help scheduler prioritize transferring new requests if none have been picked up in a while
Add yield + sleep to help scheduler when responses are processed

Note that there is still some delay due to processing times and that can be eliminated by setting a really small timeout for polling for new requests, but that can waste cycles and reduce throughput. For now, I just set a max delay of 1 second to avoid edge cases and slightly reduce the delay.

Results:

Before changes

After changes

dyastremsky commented 4 months ago

I also tried using an atomic boolean to track if there are new requests rather than checking the data structure, and the speed seems similar. There is also a tricky hang there.

Using a lock-free structure (e.g. TBB's concurrent hashmap) might be useful for speeding these up further. However, those structures add complexity and new dependencies. I believe the closest we already use within Triton is Boost.

It might be worth looking at properly redesigning our code to work with Boost to use structures (like unordered flat map) for better performance. That would add the dependency for a very recent version of Boost (not available via apt, so it needs to be installed and downloaded... similar to the requirements of server). I tried to run a quick test to see how fast it would run, but it was taking a while to redesign the code to concurrently work with Boost, which uses a slightly different paradigm. I think it's worth the investigation but seems out of bounds for this ticket.

debermudez commented 4 months ago

well those results looks really nice. crazy how fast you turned this around. Nice job.

I am curious, but not gating this ticket, if you have a sense of how high PA can now push its requests/sec before we start to see requests getting delayed.

dyastremsky commented 4 months ago

Thanks! Going to respond tomorrow.

I think I need to re-test it on a fresh mind. I got the above results pretty consistently, and now I'm seeing the old (worst) results, so this might need to be revisited. Hopefully just some weird compilation or system usage thing.

dyastremsky commented 4 months ago

Spoke with Tim. I missed that I only fixed the OpenAI client so far, so of course the densenet results don't differ. It's also apparent in the screenshots above, as the OpenAI results have the warning before the changes but not after.

Going to modify this to a draft for now to gauge effort for fixing all of the clients in one PR versus multiple.

matthewkotila commented 4 months ago

@dyastremsky Spoke with Tim. I missed that I only fixed the OpenAI client so far, so of course the densenet results don't differ. It's also apparent in the screenshots above, as the OpenAI results have the warning before the changes but not after.

Going to modify this to a draft for now to gauge effort for fixing all of the clients in one PR versus multiple.

Thanks David, this sounds aligned with my comment here: https://github.com/triton-inference-server/client/commit/ac6e5f7b883dde76c705b0dce09091b20cf84cf8#r142537021

dyastremsky commented 4 months ago

@dyastremsky Spoke with Tim. I missed that I only fixed the OpenAI client so far, so of course the densenet results don't differ. It's also apparent in the screenshots above, as the OpenAI results have the warning before the changes but not after. Going to modify this to a draft for now to gauge effort for fixing all of the clients in one PR versus multiple.

Thanks David, this sounds aligned with my comment here: ac6e5f7#r142537021

Thanks, Matt! I appreciated you calling out that the scope of work needs to be increased earlier. I missed that the two repros were testing different clients.

I don't know whether this ticket will end up being just OpenAI or whether I'll create a separate PR for the larger HTTP client ticket. It depends on what the fix ends up being and how much similarity there is in the code. The code is very close, so it's possible it will just be one PR. Tim pointed out that this would be good in that we could use the HTTP client CI tests, which are abundant and have been around for longer than the newer OpenAI ones.

dyastremsky commented 4 months ago

Closing in favor of consolidating updates in this PR: https://github.com/triton-inference-server/client/pull/684