triton-inference-server / client

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.
BSD 3-Clause "New" or "Revised" License
551 stars 227 forks source link

Fast PA teardown #674

Open tgerdesnv opened 4 months ago

tgerdesnv commented 4 months ago

Adds the ability for PA to exit quicker.

Old behavior was that PA would always wait for all requests to finish before exiting. For the case of async request-rate combined with a slow model, this could add up to many minutes of waiting.

New behavior is that for non-shared-memory cases, PA will exit immediately and drop remaining requests on the floor.

For the case of PA sweeping through multiple values (request-rate 10:20:10 for example), it WILL still wait for all requests from the first experiment to finish before going to the next step.

Something messy is that the LoadManager now needs to remember if it is shared memory or not. It was already utilizing that information, so not the end of the world.

Something important to note is that running PA back to back immediately may result in lower results on the 2nd run, as the actual server may still be draining the abandoned requests from the first run.

Here is a before/after. Note that both of these include a change (that won't be part of this story) to make sure that request-rate can actually issue the requested rate.

Before: faster_teardown_without_changes

After: faster_teardown_with_changes