Achieving high throughput without increasing number of workers

goha111 commented 1 year ago

🚀 The feature

This is probably more like a discussion.

In a production pytorch serving system, one machine will serve many models(say, at least 10 heavy models). models are very large, so ideally they should be loaded into (CPU/GPU) memory only once.

Now we have the problem:

in production setup, each model should have exact 1 handler (because of resource restrictions, especially GPU memory restriction)
on the other hand, we have to increase number of workers for models to achieve high throughput. Otherwise torchserve will complain something like "no worker available and add more workers with scale worker request"

So I have two questions related to this:

is it possible to achieve higher throughput without increasing number of workers?
is it possible to share some global resource(e.g. the model) across same handlers, so that some heavy resource is not loaded multiple times?

Motivation, pitch

achieving high-throughput in production serving scenario

Alternatives

No response

Additional context

No response

msaroufim commented 1 year ago

OK so this question doesn't have a a short answer so here we go

The easiest ways to increase the throughput of a model is to increase the number of workers and increase the batch size

Both of those come with diminishing returns where at some point your throughput will actually decrease so you need to benchmark and see what works which is why we have a benchmarking tool https://github.com/pytorch/serve/tree/master/benchmarks which we also daily in CI ourselves

You also specifically asked about sharing models and this something @mreso worked on with prototypes for a c++ frontend and a torch.deploy integration so tagging him to share more insight on this point

The next set of optimizations comes from using an optimized ML runtime so for this we have options for TensorRT, ORT, Faster and Better Transformer, torch.compile and you can see what works for you here https://github.com/pytorch/serve/releases. Typically all these runtimes optimize models by fusing operations so you're minimizing data transfers to the GPU.

Another way of minimizing data transfers to the GPU is to debug your preprocessing pipeline or leveraging an optimized solution like DALI https://github.com/pytorch/serve/pull/1958

Then of course there's the more sciency optimizations like using a smaller model, quantization, distillation etc that are easy but would require some back and forth with your science team

I will say I wish there was an easy switch to get ideal inference performance, there's a lot of tools but you can go a long way by running some simple benchmarks and profiling your model for actual bottlenecks. It might also help to scan the top posts here https://github.com/pytorch/serve#-news and see if anything seems particularly relevant to your workload

Also if you'd like to quickly meet and discuss your workload, happy to arrange a quick call where we can give you more tailored advice

mreso commented 1 year ago

Hi, Mark already made some great points. Just want to chime in and stress the points that is very important to figure out first what your bottleneck actually is before you go and apply optimizations.

For starters you could look at your GPU utilization. nvidia-smi is sometimes not the best tool for this as it only give you a rough measure of the time fraction a kernel was running on one of the SMs. Not on how many cores the kernel actually consumed. Better tools for this are usually profilers like Nsight Compute

pytorch / serve