Open Hegelim opened 2 years ago
If you're making an inference on a single model with a single worker then not using any framework will likely be the fastest. The benefits of TS come in when you're managing multiple workers per model or multiple models. But TS is also about integrations with Kubernetes, Docker, and management and metrics API with exports to make your models prod ready. We also try to include reasonable defaults or collaborate with hardware providers so you can get better out-of-the-box performance.
I see. By managing multiple workers, does it have to work given that the model itself has the capacity to support multiple workers?
There's nothing you usually need to make your model work with multiple workers. The only limitation is if your model does some crazy multiprocessing already it won't play too well with TS
Thanks! Just to confirm again, by multiple workers, do you mean multiple GPUs?
From doc here: https://github.com/pytorch/serve
It says TorchServe is a tool used to serve Pytorch models in production.
I am wondering, in theory, if we can expect to have a better performance (in terms of speed and GPU and system memory usage) to do inference on Pytorch models using TorchServe vs not using TorchServe?
I have been searching for a satisfying answer online with detailed comparison between the 2, but weirdly I couldn't find any.
For example, let's say I have a pretrained Pytorch model saved as checkpoint, I can either load this model and do inference directly, or I can serve the model using TorchServe and do inference using the REST api, which is faster? If I should expect TorchServe to do inference faster, what is used under the hood of TorchServe to deliver a faster performance? Maybe distributed computing? I am asking this question because my concern is TorchServe is primarily used for serving the model, it does not change the form of the model fundamentally, thus we cannot expect to have a performance boost.
Any explanation is appreciated.