msaroufim commented 2 years ago

TorchServe Model Analyzer

Tasks

Milestone 1

[x] #1442
[x] #1573
[x] #1484
[x] #1259
[x] #1540

Milestone 2

[x] #1506
[x] #1504
[ ] AutoML for inference
Problem Statement

Preparing a model for inference is becoming an increasingly important part of shipping models to production. There's an overwhelming amount of choice ranging from which hardware to use, which configs to set, which optimizations to use, what are the tradeoffs, how to benchmark things properly.

All of this has immediate because it helps anyone run PyTorch models more quickly and more cheaply.

This is an overwhelming amount of problems to delegate to users in an unstructured way so what does an end to end solution look like?

Subproblems

There's a few components to the solution

How to benchmark models and get key metrics?
How to profile models using various tools to figure out bottlenecks in a structured way?
How to be aware of and explore various optimizations?

The good thing is we've already built most of these tools in isolation but we haven't yet strung them together into a cohesive story

Finally we need to add support for benchmarking on specific docker images so users can modularize their benchmark runs and allow anyone to run them without involving a complex machine setup.

Solutions

Benchmarking models

The current torchserve benchmarking story relies on apache-bench where users would package up a model into a .mar file, setup a config.json and then run

{
  "url":"https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar",
  "requests": 1000,
  "concurrency": 10,
  "input": "../examples/image_classifier/kitten.jpg",
  "exec_env": "docker",
  "gpus": "2"
}

python benchmark-ab.py --config config.json

Which would provide lots of useful information like throughput, latency at X, number of errors. There's also lots of nuance that benchmarking tools need to think of like process isolation, cold starts which will throw off people building their own benchmarking tools from scratch.

And this approach is now being improved in #1442 by @lxning to

Make configs YAML based with multiple options for a config to allow easy grid search
Allow JSON export for easy comparisons to past runs
Easy export to dashboarding solutions like cloudwatch or prometheus

A major benefit to the approach in #1442 is that by using a standard format it makes it easy to compare, sort and filter models for e.g

Show me only models that have lower than 50ms latency
Show me only models with throughput greater than 1000
Show me only models that consume less than 30% of GPU memory

The only thing #1442 is missing is allowing anyone to run comprehensive benchmarks as well on real infrastructure. There's two options here

Add AWS credentials as an argument to suite
Use a Github Action based workflow

AWS is convenient because it's most flexible in setting up environment yet won't work for any community member that may be using another cloud. Making our work multi cloud will also be very time consuming unless we move to something like teraform templates

Github Actions need work to setup custom runners to allow GPU profiling BUT their big benefit is that artifacts can be made available directly in the Github Actions tab so anyone can inspect them without needing permissions to a special S3 bucket. Also because it's all on Github if community members want to run their own benchmarks all they need to do is fork the repo and then run their own benchmarks.

Profiling

We've recently added support in torchserve for the pytorch profiler gated behind a simple environment variable https://github.com/pytorch/serve/blob/master/docs/performance_guide.md#profiling

export ENABLE_TORCH_PROFILER=TRUE

And this provides some useful insight when it comes to debugging problems with the Pytorch model but not so much problems with configuring torchserve. There are an extensive number of options of profilers that can be run in a separate process without affecting performance that we can either recommend users run or run out of the box gated behind some other environment variable.

Options for macro profilers

Omni-perf: @yqhu has his own profiling tool which is an aggregator that he's used to great effect: https://github.com/yqhu/omni_perf
Scalene: https://github.com/plasma-umass/scalene is a low overhead tool for CPU, GPU and memory profiling
Others: https://github.com/msaroufim/awesome-profiling there are no shortage of profiler tools

Exploring various optimizations

Optimizations fall into a few very different categories

Optimizations to the model

When optimizing a model there's a few commonly used tricks for quantization to pruning to distillation to using a smaller model.

We've attempted to unify many of these tools behind a single CLI interface called torchprep a still very much experimental tool which needs lots of work.

# quantize a cpu model with int8 on cpu and profile with a float tensor of shape [64,3,7,7]
torchprep quantize models/resnet152.pt int8 --input-shape 64,3,7,7

# profile a model for a 100 iterations
torchprep profile models/resnet152.pt --iterations 100 --device cpu --input-shape 64,3,7,7

# set omp threads to 1 to optimize cpu inference
torchprep env --device cpu

# Prune 30% of model weights
torchprep prune models/resnet152.pt --prune-amount 0.3

torchprep unfortunately has 3 weaknesses

No good data format for multiple inputs and multiple dtypes
No good support training aware optimizations including calibrations
Can't deal with runtime exports

Input data format

An example of how to use torchprep is this

torchprep quantize models/resnet152.pt int8 --input-shape 64,3,7,7

The input shape is used to generate a random matrix torch.randn(64,3,7,7) run that through the resnet152 model and calculate the latency

In this case resnet152 expects a single input with shape of [64,3,7,7] however this doesn't work quite as well for something like BERT which requires 2 inputs the tokens and masks

The current data format also doesn't make it easy to deal with arbitrary sized data like batches which can range from 1-n

Instead, we could design a YAML based data format that would support multiple inputs and dtypes so something like @jamesr66a

input1:
  size: 64
  dtpe:float16
input2:
 size: -1 (aka arbitrary)
 dtype: longint

Training aware optimizations

Training aware optimizations generally keep better training performance and are used in libraries like huggingface/optimum.

torchprep currently works only with saved model weights but a natural extension would be adding support for users to add their own training loop or data loader.

Runtime exports

A lot of torchserve users have been looking to export their models to an optimized runtime like TensorRT/IPEX/ORT for accelerated inference. All of these runtimes call an inference within the context of a session and won't work in an offline manner or are stored directly on a saved model.

Optimizations to the serving framework

Optimizations to the serving framework are even more opaque but include notable things like num_workers, num_threads, number of models per GPU, queue_size, batch_size

Out of all of these configurations only batch_size has a clear tradeoff

Big batch size means low latency, high throughput with diminishing returns
Small batch size means low latency, low throughput

Whereas for the others the tradeoff isn't so clear and the expectation is to run a grid search which depending on the model can take days of experiments which won't even lead to a conclusive solution. The goal should not be a comprehensive grid search but just enough to be able to detect performance tradeoffs

So there's a few options here

Leveraging other libraries for optimizations like https://github.com/pytorch/serve/pull/1401 which takes care of pinning workers to different CPU cores so users don't have to experiment with it
Using a simple ranking model to decide on optimizations
Bayesian optimization like Ax but for inference
Scaling out experiments by launching various instances of torchserve concurrently and then collecting the results in a central place
Try out simple heuristics based on QPS or utilization to change torchserve level configs like num of workers and see what happens. Could also include this by default in torchserve --start --configure_worker which would do this

Conclusion

Analyzing models is hard, building benchmark suites is hard so it's worthwhile creating a streamlined experience for all of the above to make it easier for people to benchmark, profile, optimize and analyze their models.

cc: @chauhang @HamidShojanazeri @yqhu @mreso @lxning @nskool @maaquib @ashokei @d4l3k @gchanan

d4l3k commented 2 years ago

Are most of the parameters that users want to tune available from the RPC endpoint or do you need to change the server config for each set? Might be feasible to bundle up the torchserver benchmark suite as a torchx component and run it via Ax for proper Bayesian HPO optimization

That sounds like it may require a lot of knobs to configure correctly so might be too hard to get started with for the average user

msaroufim commented 2 years ago

Are most of the parameters that users want to tune available from the RPC endpoint or do you need to change the server config for each set?

You can change a lot of torchserve parameters dynamically using the management API, you can also swap in a new more optimized model in the same running instance so it should be possible to do quite a bit without ever having to stop torchserve

That sounds like it may require a lot of knobs to configure correctly so might be too hard to get started with for the average user

Maybe, I think we can do quite a bit to streamline comparisons