Closed msaroufim closed 2 years ago
Are most of the parameters that users want to tune available from the RPC endpoint or do you need to change the server config for each set? Might be feasible to bundle up the torchserver benchmark suite as a torchx component and run it via Ax for proper Bayesian HPO optimization
That sounds like it may require a lot of knobs to configure correctly so might be too hard to get started with for the average user
Are most of the parameters that users want to tune available from the RPC endpoint or do you need to change the server config for each set?
You can change a lot of torchserve parameters dynamically using the management API, you can also swap in a new more optimized model in the same running instance so it should be possible to do quite a bit without ever having to stop torchserve
That sounds like it may require a lot of knobs to configure correctly so might be too hard to get started with for the average user
Maybe, I think we can do quite a bit to streamline comparisons
I think there's a lot of opportunities for automatic tuning while the server is running based off of memory, qps and latency.
A lot of these pain points we might be able to just mitigate at the service level so you don't even need to param sweep at all
Advanced users would likely want more control but I bet smarter defaults/auto tuning is good enough for 80% of users
TorchServe Model Analyzer
Tasks
Milestone 1
Milestone 2
Problem Statement
Preparing a model for inference is becoming an increasingly important part of shipping models to production. There's an overwhelming amount of choice ranging from which hardware to use, which configs to set, which optimizations to use, what are the tradeoffs, how to benchmark things properly.
All of this has immediate because it helps anyone run PyTorch models more quickly and more cheaply.
This is an overwhelming amount of problems to delegate to users in an unstructured way so what does an end to end solution look like?
Subproblems
There's a few components to the solution
The good thing is we've already built most of these tools in isolation but we haven't yet strung them together into a cohesive story
Finally we need to add support for benchmarking on specific docker images so users can modularize their benchmark runs and allow anyone to run them without involving a complex machine setup.
Solutions
Benchmarking models
The current
torchserve
benchmarking story relies onapache-bench
where users would package up a model into a.mar
file, setup aconfig.json
and then runpython benchmark-ab.py --config config.json
Which would provide lots of useful information like throughput, latency at X, number of errors. There's also lots of nuance that benchmarking tools need to think of like process isolation, cold starts which will throw off people building their own benchmarking tools from scratch.
And this approach is now being improved in #1442 by @lxning to
A major benefit to the approach in #1442 is that by using a standard format it makes it easy to compare, sort and filter models for e.g
The only thing #1442 is missing is allowing anyone to run comprehensive benchmarks as well on real infrastructure. There's two options here
AWS is convenient because it's most flexible in setting up environment yet won't work for any community member that may be using another cloud. Making our work multi cloud will also be very time consuming unless we move to something like teraform templates
Github Actions need work to setup custom runners to allow GPU profiling BUT their big benefit is that artifacts can be made available directly in the Github Actions tab so anyone can inspect them without needing permissions to a special S3 bucket. Also because it's all on Github if community members want to run their own benchmarks all they need to do is fork the repo and then run their own benchmarks.
Profiling
We've recently added support in
torchserve
for the pytorch profiler gated behind a simple environment variable https://github.com/pytorch/serve/blob/master/docs/performance_guide.md#profilingexport ENABLE_TORCH_PROFILER=TRUE
And this provides some useful insight when it comes to debugging problems with the Pytorch model but not so much problems with configuring
torchserve
. There are an extensive number of options of profilers that can be run in a separate process without affecting performance that we can either recommend users run or run out of the box gated behind some other environment variable.Options for macro profilers
Exploring various optimizations
Optimizations fall into a few very different categories
Optimizations to the model
When optimizing a model there's a few commonly used tricks for quantization to pruning to distillation to using a smaller model.
We've attempted to unify many of these tools behind a single CLI interface called
torchprep
a still very much experimental tool which needs lots of work.torchprep
unfortunately has 3 weaknessesInput data format
An example of how to use
torchprep
is thistorchprep quantize models/resnet152.pt int8 --input-shape 64,3,7,7
The input shape is used to generate a random matrix
torch.randn(64,3,7,7)
run that through the resnet152 model and calculate the latencyIn this case resnet152 expects a single input with shape of
[64,3,7,7]
however this doesn't work quite as well for something like BERT which requires 2 inputs the tokens and masksThe current data format also doesn't make it easy to deal with arbitrary sized data like batches which can range from
1-n
Instead, we could design a YAML based data format that would support multiple inputs and dtypes so something like @jamesr66a
Training aware optimizations
Training aware optimizations generally keep better training performance and are used in libraries like
huggingface/optimum
.torchprep
currently works only with saved model weights but a natural extension would be adding support for users to add their own training loop or data loader.Runtime exports
A lot of
torchserve
users have been looking to export their models to an optimized runtime likeTensorRT/IPEX/ORT
for accelerated inference. All of these runtimes call aninference
within the context of asession
and won't work in an offline manner or are stored directly on a saved model.Optimizations to the serving framework
Optimizations to the serving framework are even more opaque but include notable things like
num_workers
,num_threads
, number of models per GPU,queue_size
,batch_size
Out of all of these configurations only
batch_size
has a clear tradeoffWhereas for the others the tradeoff isn't so clear and the expectation is to run a grid search which depending on the model can take days of experiments which won't even lead to a conclusive solution. The goal should not be a comprehensive grid search but just enough to be able to detect performance tradeoffs
So there's a few options here
torchserve --start --configure_worker
which would do thisConclusion
Analyzing models is hard, building benchmark suites is hard so it's worthwhile creating a streamlined experience for all of the above to make it easier for people to benchmark, profile, optimize and analyze their models.
cc: @chauhang @HamidShojanazeri @yqhu @mreso @lxning @nskool @maaquib @ashokei @d4l3k @gchanan