kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.61k stars 700 forks source link

KEP-2170: Add APIs for TrainJob and TrainingRuntime #2206

Closed andreyvelich closed 2 months ago

andreyvelich commented 3 months ago

Related: https://github.com/kubeflow/training-operator/issues/2170.

We should add APIs for TrainJob, TrainingRuntime, ClusterTrainingRuntime.

The directory structure that we want to follow:

/cmd/training-operator.v2alpha1/Dockerfile
/cmd/training-operator.v2alpha1/main.go

/pkg/apis/kubeflow.org/v2alpha1
/pkg/controller.v2/trainjob_controller.go

/pkg/webhooks/trainjob_webhook.go

/assign @andreyvelich /area api

tenzen-y commented 3 months ago

Could you separate the PR to create skeleton v2 manager and API changes since it is better to define APIs in the dedicated PR for easy tracking?

1st issue and PR is responsible for setting up skeleton v2 manager.

2nd issue and PR is responsible for adding APis.

Regarding the below implementations should be treated as other issues:

andreyvelich commented 3 months ago

Could you separate the PR to create skeleton v2 manager and API changes since it is better to define APIs in the dedicated PR for easy tracking?

1st issue and PR is responsible for setting up skeleton v2 manager.

  • /cmd/training-operator.v2alpha1/Dockerfile
  • /cmd/training-operator.v2alpha1/main.go

2nd issue and PR is responsible for adding APis.

  • /pkg/apis/kubeflow.org/v2alpha1

Regarding the below implementations should be treated as other issues:

  • /pkg/controller.v2/trainjob_controller.go
  • /pkg/webhooks/trainjob_webhook.go

Sure, that makes sense! Will do that.