Open andrewsykim opened 3 weeks ago
The immediate fix we could do is bump the kuberay dependency in Kueue to a version that includes all new fields, but this doesn't seem like a scalable approach. The defaulting behavior seems a bit unusual to me, maybe it's something specific to controller-runtime webhooks?
In v0.8 we fixed the job reconcilers to use Patch instead of Update to avoid this problem https://github.com/kubernetes-sigs/kueue/pull/2501
But we didn't check the webhooks, so that's probably the problem.
No, it doesn't look like it's the defaulter, because it's also using Patch https://github.com/kubernetes-sigs/controller-runtime/blob/e6c3d139d2b6c286b1dbba6b6a95919159cfe655/pkg/webhook/admission/defaulter_custom.go#L88-L93
Looking at GKE audit logs, the only API updates come from the ray operator. Thus, the only possible culprit is the webhook.
Thinking about this further: Yes, the webhook uses patch, which is good. However, the patch is build from the difference between the Raw object (which will have the field) and the marshalled object (which doesn't have the field).
PatchResponseFromRaw(req.Object.Raw, marshalled)
So it's a bug in controller-runtime.
I'll try to fix it there, but while waiting for a release there, our only chance is to update the APIs in Kueue.
/assign @alculquicondor
As a short-term fix should we update kuberay version to v1.2? cc @astefanutti
As a short-term fix should we update kuberay version to v1.2? cc @astefanutti
We already done it on https://github.com/kubernetes-sigs/kueue/pull/2960.
Ah I missed that, thank you!
What happened:
When using Kueue with KubeRay, new fields in KubeRay that are not recognized by Kueue are dropped during defaulting. Here's an example using Kind and RayJob:
Create cluster:
Install Kueue:
Install latest release candidate of KubeRay:
Deploy Kueue resources:
Create a RayJob with a reference to the local queue:
Note the RayJob sets a new field introduced in v1.2.0
spec.backoffLimit
. In this example it is set to 2. However, Kueue is defaulting the field to 0:What you expected to happen:
I believe we saw similar behavior while transition KubeRay v0.6.0 to v1.0.0, but I thought it was specific to the v1apha1 -> v1 upgrade when we deleted fields in v1alpha1.
I would not expect Kueue to drop new fields introduced in v1 APIs of KubeRay.
How to reproduce it (as minimally and precisely as possible):
See steps above.
Anything else we need to know?:
Environment:
kubectl version
):git describe --tags --dirty --always
): v0.8.0cat /etc/os-release
):uname -a
):