kubernetes-sigs / jobset

JobSet: a k8s native API for distributed ML training and HPC workloads
https://jobset.sigs.k8s.io/
Apache License 2.0
138 stars 45 forks source link

Wait for the webhook service to be listening before advertising the Jobset replica as ready. #607

Closed mbobrovskyi closed 3 months ago

mbobrovskyi commented 3 months ago

What would you like to be added: As mentioned on the title, wait for the webhook service to be listening before advertising the Jobset replica as ready. Like on the kueue here.

Why is this needed: It causes flakes in kueue https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/2408/pull-kueue-test-e2e-main-1-28/1802723422271180800. To fix it we need to wait for jobset operator ready and to be sure that webhooks runs success.

mbobrovskyi commented 3 months ago

cc @mimowo @alculquicondor

mimowo commented 3 months ago

+1 in Kueue we use readiness probes to delay marking the deployment as available until the webhook service is ready.

This allows users to wait for the webhooks service by waiting for availability of the deployment.

We use this mechanism in kueue e2e tests here, but it can also be checked conveniently by users by kubectl wait --for=condition=available.

kannon92 commented 3 months ago

Sounds very sane. @mbobrovskyi would you want to contribute a patch?

mbobrovskyi commented 3 months ago

Yes.

/assign

danielvegamyhre commented 3 months ago

Makes sense, thanks for working on this @mbobrovskyi