Closed andreyvelich closed 2 months ago
@danielvegamyhre
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % | ||
---|---|---|---|---|---|
pkg/apis/kubeflow.org/v2alpha1/trainingruntime_types.go | 0 | 3 | 0.0% | ||
pkg/apis/kubeflow.org/v2alpha1/trainjob_types.go | 0 | 3 | 0.0% | ||
pkg/apis/kubeflow.org/v2alpha1/zz_generated.deepcopy.go | 0 | 616 | 0.0% | ||
<!-- | Total: | 0 | 622 | 0.0% | --> |
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
pkg/controller.v1/mpi/mpijob.go | 1 | 91.06% | ||
<!-- | Total: | 1 | --> |
Totals | |
---|---|
Change from base Build 10512072223: | -1.7% |
Covered Lines: | 3950 |
Relevant Lines: | 12421 |
@andreyvelich First of all, could you generate / createregister.go
, deepcopygen and so on by controller-tools?
Unless those functions, we can not use the API in the controllers/webhooks.
@andreyvelich First of all, could you generate / create
register.go
, deepcopygen and so on by controller-tools?Unless those functions, we can not use the API in the controllers/webhooks.
I registered APIs with scheme and added deepcopygen via controller-gen. I think, we can add the defaulters and other parameters required for clients, listers, informers, in the following PRs. Does it look good @tenzen-y ?
@andreyvelich First of all, could you generate / create
register.go
, deepcopygen and so on by controller-tools? Unless those functions, we can not use the API in the controllers/webhooks.I registered APIs with scheme and added deepcopygen via controller-gen. I think, we can add the defaulters and other parameters required for clients, listers, informers, in the following PRs. Does it look good @tenzen-y ?
That sounds good to me. We can create a separate issue "KEP-2170: Provide client-go library for TrainJob and TrainingRuntime".
@andreyvelich First of all, could you generate / create
register.go
, deepcopygen and so on by controller-tools? Unless those functions, we can not use the API in the controllers/webhooks.I registered APIs with scheme and added deepcopygen via controller-gen. I think, we can add the defaulters and other parameters required for clients, listers, informers, in the following PRs. Does it look good @tenzen-y ?
That sounds good to me. We can create a separate issue "KEP-2170: Provide client-go library for TrainJob and TrainingRuntime".
Created: https://github.com/kubeflow/training-operator/issues/2224
Are there any other comments before we can merge this PR and start working on the controller implementation ?
/assign @shravan-achar @tenzen-y @kannon92 @kuizhiqing @terrytangyuan @johnugeorge /hold cancel
@andreyvelich: GitHub didn't allow me to assign the following users: shravan-achar, kannon92.
Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide
LGTM on my end.
We had some discussions with @tenzen-y about APIs and we proposed the following changes:
Define JobSetTemplateSpec
under TrainingRuntimeSpec
API. For some use-cases, Platform Engineers might want to add custom labels and annotations to JobSet for various features, such as alpha.jobset.sigs.k8s.io/exclusive-topology
or alpha.jobset.sigs.k8s.io/node-selector
: https://github.com/kubernetes/website/pull/47383.
We don't want to define custom propagation mechanism from TrainingRuntime metadata to the JobSet, since it is not straighforward.
Move numNodes
under MLSpec
We are still debating between MLSpec
vs MLPolicy
API name. Any thoughts @kannon92 @kubeflow/wg-training-leads @kuizhiqing @shravan-achar @vsoch ?
/rerun-all
We made the final changes with @tenzen-y:
MLSpec
to MLPolicy
since spec
usually represents another Kubernetes resources that will be deployed.PodGroupSpec
to PodGroupPolicy
for the same reason, and make various schedulers (coscheduling, volcano, or YuniKorn) as one of API.If we don't have any followup suggestions, we can merge it.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: tenzen-y
The full list of commands accepted by this bot can be found here.
The pull request process is described here
/hold
/lgtm
Fixes: https://github.com/kubeflow/training-operator/issues/2206
I added APIs for TrainJob, TrainingRuntime, and ClusterTrainingRuntime resources.
/assign @kubeflow/wg-training-leads @kannon92 @mimowo @vsoch @ahg-g @kuizhiqing @alculquicondor @zw0610 @franciscojavierarceo @shravan-achar
/hold for review.