kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.6k stars 696 forks source link

KEP-2170: Replace UPSERT operation for the objects with SSA PATCH #2297

Open tenzen-y opened 1 week ago

tenzen-y commented 1 week ago

What you would like to be added?

Currently, the TrainJob reconciler does UPSERT operations to create or update objects.

https://github.com/kubeflow/training-operator/blob/9e04bdd74920cbe12ecc786010d80fb8986f27bd/pkg/controller.v2/trainjob_controller.go#L85-L105

Doing only the SSA PATCH operation for CREATE and UPDATE would be great.

Why is this needed?

Eliminating UPSERT operations could mitigate the complexity of the object operating mechanism, and improve performance by reduced API calls.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

tenzen-y commented 1 week ago

I'm wondering if @varshaprasad96 has knowledge and is interested in this enhancement.

tenzen-y commented 1 week ago

/remove-label lifecycle/needs-triage

varshaprasad96 commented 1 week ago

@tenzen-y Sure, I can take this up, SSA would help in managing conflicts better.

varshaprasad96 commented 1 week ago

/assign

tenzen-y commented 6 days ago

@tenzen-y Sure, I can take this up, SSA would help in managing conflicts better.

Thank you for taking this issue.

One question is I'm curious if we should take a similar approach as cluster-api so that we can directly handle the client.Object and reduce unneeded reconciliation. But, now, I'm not clear which fields we should drop or add: https://github.com/kubernetes-sigs/cluster-api/blob/578b70f79659003a005f390cc022cf17f151cebc/internal/util/ssa/patch.go#L64

varshaprasad96 commented 6 days ago

I'm curious if we should take a similar approach as cluster-api so that we can directly handle the client.Object and reduce unneeded reconciliation.

IIUC, CAPI uses dry run and a cache specifically for SSA to determine if update request is to be sent by calculating the diff b/w server and expected state. This definitely has benefits - especially given it is used in places where multiple controllers are acting on the same object so no. of reconcile calls are indeed a load on API server (I'm not super familiar with the CAPI controllers implementation, so feel free to correct me if I'm wrong).

I'm curious if reconciliations on trainJob are going to be so frequent enough that we need to implement this in the first iteration (given they are majorly being batch workloads). How about we just use SSA with ApplyConfigurations for now, and proceed on to implementing a caching layer in follow ups if frequent reconciliation is turning out to be an issue. Meanwhile I would also be curious in general on how caching metrics for ssa cache turn out in CAPI (https://github.com/kubernetes-sigs/cluster-api/issues/10527).

tenzen-y commented 5 days ago

IIUC, CAPI uses dry run and a cache specifically for SSA to determine if update request is to be sent by calculating the diff b/w server and expected state.

That is the same with my understanding to CAPI SSA PATCH mechanism.

How about we just use SSA with ApplyConfigurations for now, and proceed on to implementing a caching layer in follow ups if frequent reconciliation is turning out to be an issue.

Actually, the dry run calculation mechanism could bring us to mitigate conflicts since a set of resources (client.Object) like JobSet is interested by TrainJob controllers and JobSet controllers. However, I agree that we start from simplified ApplyConfiguration mechanism, and postpone the SSA cache mechanism in the future.