pytorch / serve

Serve, optimize and scale PyTorch models in production
Apache License 2.0
4.08k stars 829 forks source link

Does TorchServe have better performance than calling Pytorch #1707

Open Hegelim opened 2 years ago

Hegelim commented 2 years ago

From doc here:

It says TorchServe is a tool used to serve Pytorch models in production.

I am wondering, in theory, if we can expect to have a better performance (in terms of speed and GPU and system memory usage) to do inference on Pytorch models using TorchServe vs not using TorchServe?

I have been searching for a satisfying answer online with detailed comparison between the 2, but weirdly I couldn't find any.

For example, let's say I have a pretrained Pytorch model saved as checkpoint, I can either load this model and do inference directly, or I can serve the model using TorchServe and do inference using the REST api, which is faster? If I should expect TorchServe to do inference faster, what is used under the hood of TorchServe to deliver a faster performance? Maybe distributed computing? I am asking this question because my concern is TorchServe is primarily used for serving the model, it does not change the form of the model fundamentally, thus we cannot expect to have a performance boost.

Any explanation is appreciated.

msaroufim commented 2 years ago

If you're making an inference on a single model with a single worker then not using any framework will likely be the fastest. The benefits of TS come in when you're managing multiple workers per model or multiple models. But TS is also about integrations with Kubernetes, Docker, and management and metrics API with exports to make your models prod ready. We also try to include reasonable defaults or collaborate with hardware providers so you can get better out-of-the-box performance.

Hegelim commented 2 years ago

I see. By managing multiple workers, does it have to work given that the model itself has the capacity to support multiple workers?

msaroufim commented 2 years ago

There's nothing you usually need to make your model work with multiple workers. The only limitation is if your model does some crazy multiprocessing already it won't play too well with TS

Hegelim commented 2 years ago

Thanks! Just to confirm again, by multiple workers, do you mean multiple GPUs?