pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.16k stars 838 forks source link

Achieving high throughput without increasing number of workers #2071

Open goha111 opened 1 year ago

goha111 commented 1 year ago

🚀 The feature

This is probably more like a discussion.

In a production pytorch serving system, one machine will serve many models(say, at least 10 heavy models). models are very large, so ideally they should be loaded into (CPU/GPU) memory only once.

Now we have the problem:

So I have two questions related to this:

Motivation, pitch

achieving high-throughput in production serving scenario

Alternatives

No response

Additional context

No response

msaroufim commented 1 year ago

OK so this question doesn't have a a short answer so here we go

The easiest ways to increase the throughput of a model is to increase the number of workers and increase the batch size

Both of those come with diminishing returns where at some point your throughput will actually decrease so you need to benchmark and see what works which is why we have a benchmarking tool https://github.com/pytorch/serve/tree/master/benchmarks which we also daily in CI ourselves

You also specifically asked about sharing models and this something @mreso worked on with prototypes for a c++ frontend and a torch.deploy integration so tagging him to share more insight on this point

The next set of optimizations comes from using an optimized ML runtime so for this we have options for TensorRT, ORT, Faster and Better Transformer, torch.compile and you can see what works for you here https://github.com/pytorch/serve/releases. Typically all these runtimes optimize models by fusing operations so you're minimizing data transfers to the GPU.

Another way of minimizing data transfers to the GPU is to debug your preprocessing pipeline or leveraging an optimized solution like DALI https://github.com/pytorch/serve/pull/1958

Then of course there's the more sciency optimizations like using a smaller model, quantization, distillation etc that are easy but would require some back and forth with your science team

I will say I wish there was an easy switch to get ideal inference performance, there's a lot of tools but you can go a long way by running some simple benchmarks and profiling your model for actual bottlenecks. It might also help to scan the top posts here https://github.com/pytorch/serve#-news and see if anything seems particularly relevant to your workload

Also if you'd like to quickly meet and discuss your workload, happy to arrange a quick call where we can give you more tailored advice

mreso commented 1 year ago

Hi, Mark already made some great points. Just want to chime in and stress the points that is very important to figure out first what your bottleneck actually is before you go and apply optimizations.

For starters you could look at your GPU utilization. nvidia-smi is sometimes not the best tool for this as it only give you a rough measure of the time fraction a kernel was running on one of the SMs. Not on how many cores the kernel actually consumed. Better tools for this are usually profilers like Nsight Compute