kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.6k stars 697 forks source link

KEP-2170: Create PyTorch multi-node distributed training runtime #2211

Open andreyvelich opened 2 months ago

andreyvelich commented 2 months ago

Related: https://github.com/kubeflow/training-operator/issues/2170

We should create ClusterTrainingRuntime for PyTorch multi-node distributed training.

/area runtime

yang20150702 commented 2 months ago

I'm learning training-operator v1, I want to work for this issue. Please give me some suggestions.

deepanker13 commented 1 week ago

/assign