Torchserve Workflow Fails at Medium QPS

pytorch / serve

Serve, optimize and scale PyTorch models in production

https://pytorch.org/serve/

Apache License 2.0

4.23k stars 864 forks source link

Torchserve Workflow Fails at Medium QPS #1581

Open mossaab0 opened 2 years ago

mossaab0 commented 2 years ago

We have 2 onnx models deployed in a GPU machine built on top of the nightly docker image.

The first model runs with 0 failure at 500 QPS (p99 latency < 8ms) during a 2-hour perf test.
The second model runs with 0 failure at 500 QPS (p99 latency < 11ms) during a 2-hour perf test. But some improvement in p99 latency (<9ms) at a reduce QPS of 400.
When I try a sequential workflow that starts with the first model and, in ~1% of the cases, triggers the second model, then the machine becomes irresponsive after a few minutes at 100 QPS, causing the perf test to fail. After a few hours, I accidently discovered that the machine became responsive again (I don't know when exactly, though).
Running this same workflow with only 20 QPS, the perf test succeeds for a duration of 24 hours (with only 52 failures).

I suspect there is a delay in releasing the resources that becomes an issue only with high QPS (these resources are eventually released later, bring the machine back to life).

maaquib commented 2 years ago

@mossaab0 What version of TS are you using? Can you try building TS from source and let me know if it still fails. I suspect this is the same issue for which I pushed fix #1552 (will be added to the next release)

mossaab0 commented 2 years ago

@maaquib This is based on torchserve-nightly:gpu-2022.04.13 which already includes the #1552 fix. Before the fix, even 20 QPS was failing.

maaquib commented 2 years ago

@mossaab0 If you can provide some reproduction steps, I can try to rootcause this

mossaab0 commented 2 years ago

@maaquib it is a bit difficult to provide more reproduction steps, as that would basically mean sharing the models. But I think here is something you can try (which I haven't tried, though). Figure out the maximum QPS that a GPU node can handle for the cat / dog classifier (for a couple of hours). Then, run a perf test with half of that QPS using the sequential workflow (i.e., including dog breeds model) for a couple of hours. I expect the second perf test to fail.

msaroufim commented 2 years ago

Hi @mossaab0 we've discussed this internally, we're in the progress of redesigning how workflows work and make it possible to define a DAG within your handler file in python.

It should be possible to take an existing sequential workflow or parallel workflow and refactor it a new nn.Module or handler.py please ping me if you need any advice on how to do this

jonhilgart22 commented 2 years ago

Hi @mossaab0 we've discussed this internally, we're in the progress of redesigning how workflows work and make it possible to define a DAG within your handler file in python.

It should be possible to take an existing sequential workflow or parallel workflow and refactor it a new nn.Module or handler.py please ping me if you need any advice on how to do this

I'm also running into this. Any pointers to what the refactor would look like?