Open yurkoff-mv opened 1 year ago
@yurkoff-mv TorchServe will provide a new feature "TorchServe Open Platform for Large Distributed Model Inference" soon. It is able to address your questions.
@yurkoff-mv torchrun was integrated into TorchServe. here is an example PR
Thank you! Unfortunately, I cannot view these files. When will they be publicly available? And another question, is it planned to support the use of GPUs located on several nodes (multi-node)?
🚀 The feature
How to deploy a model service that spans multiple GPUs?
Motivation, pitch
I have a large model which I run via
torchrun
. I use the FairScale library to distribute the model. How can I make friends with the TorchServe? I need to initialize the model with weights, distribute it across multiple GPUs, and make it wait for a batch to infer. But torchrun is only capable of running a function with deterministic parameters for execution. How can I write a handler in this case?Alternatives
No response
Additional context
No response