Open goha111 opened 1 year ago
OK so this question doesn't have a a short answer so here we go
The easiest ways to increase the throughput of a model is to increase the number of workers and increase the batch size
Both of those come with diminishing returns where at some point your throughput will actually decrease so you need to benchmark and see what works which is why we have a benchmarking tool https://github.com/pytorch/serve/tree/master/benchmarks which we also daily in CI ourselves
You also specifically asked about sharing models and this something @mreso worked on with prototypes for a c++ frontend and a torch.deploy integration so tagging him to share more insight on this point
The next set of optimizations comes from using an optimized ML runtime so for this we have options for TensorRT, ORT, Faster and Better Transformer, torch.compile and you can see what works for you here https://github.com/pytorch/serve/releases. Typically all these runtimes optimize models by fusing operations so you're minimizing data transfers to the GPU.
Another way of minimizing data transfers to the GPU is to debug your preprocessing pipeline or leveraging an optimized solution like DALI https://github.com/pytorch/serve/pull/1958
Then of course there's the more sciency optimizations like using a smaller model, quantization, distillation etc that are easy but would require some back and forth with your science team
I will say I wish there was an easy switch to get ideal inference performance, there's a lot of tools but you can go a long way by running some simple benchmarks and profiling your model for actual bottlenecks. It might also help to scan the top posts here https://github.com/pytorch/serve#-news and see if anything seems particularly relevant to your workload
Also if you'd like to quickly meet and discuss your workload, happy to arrange a quick call where we can give you more tailored advice
Hi, Mark already made some great points. Just want to chime in and stress the points that is very important to figure out first what your bottleneck actually is before you go and apply optimizations.
For starters you could look at your GPU utilization. nvidia-smi is sometimes not the best tool for this as it only give you a rough measure of the time fraction a kernel was running on one of the SMs. Not on how many cores the kernel actually consumed. Better tools for this are usually profilers like Nsight Compute
🚀 The feature
This is probably more like a discussion.
In a production pytorch serving system, one machine will serve many models(say, at least 10 heavy models). models are very large, so ideally they should be loaded into (CPU/GPU) memory only once.
Now we have the problem:
So I have two questions related to this:
Motivation, pitch
achieving high-throughput in production serving scenario
Alternatives
No response
Additional context
No response