Implement PyTorch controller for multi node GPU scaling

pluralsh / plural-artifacts

Artifacts for applications deployable by plural

Apache License 2.0

48 stars 35 forks source link

Implement PyTorch controller for multi node GPU scaling #155

Open jaystary opened 2 years ago

jaystary commented 2 years ago

Use Case

We want to utilize multi node GPU scaling for PyTorch for a benchmark / potential larger scale model training

Ideas of Implementation

Implement KF Training Operator which as an added benefit should unlock all the relevant frameworks as well. https://github.com/kubeflow/training-operator

Message from the maintainers:

Excited about this feature? Give it a :thumbsup:. We factor engagement into prioritization.

davidspek commented 2 years ago

The PyTorch operator resource can already be used in the current Kubeflow deployment, so there's no need to pause development while we update the deployment to the training operator.