kubernetes-sigs / kueue

Kubernetes-native Job Queueing
https://kueue.sigs.k8s.io
Apache License 2.0
1.44k stars 261 forks source link

Surface errors when a PodTemplate or a ProvReq is invalid #3025

Closed alculquicondor closed 1 month ago

alculquicondor commented 2 months ago

What would you like to be added:

When a ProvReq or PodTemplate is rejected by a webhook, Kueue just logs the error and continues retrying. These errors will not be visible to end-users and they might just interpret them as "kueue is stuck". We should communicate these errors in the Workload object, maybe even produce an event?

Why is this needed:

A cloud provider could have a webhook to validate PodTemplates created for ProvisioningRequests.

These errors need to be surfaced to users so they can fix any problem about them.

Completion requirements:

This enhancement requires the following artifacts:

The artifacts should be linked in subsequent comments.

kannon92 commented 2 months ago

In jobset, we did look at https://github.com/kubernetes-sigs/kubectl-validate to help with validation of these fields.

Not sure if this would be helpful here as you could try creating these objects and if they fail they you bubble of the error as a condition or event.

alculquicondor commented 2 months ago

The error could come from a webhook, for which kubectl-validate wouldn't help.

mimowo commented 2 months ago

+1 for the feature

The starting point could be to record any ProvReq creation errors here, or a level up errors in the check's message.

IrvingMg commented 2 months ago

/assign

mimowo commented 2 months ago

I see the event PR, but wondering if this is enough.

In particular, events are temporary objects, so it is not clear if an admin would notice them. OTOH the ProvReq creation is most likely re-attempted, so the event will be generated continuously, so should be easy to notice.

We could also explore the option of exposing the error as a status in the Message field in the AdmissionCheckState: https://github.com/kubernetes-sigs/kueue/blob/e971646642858835bf7050a514205155b7929a82/apis/kueue/v1beta1/workload_types.go#L249.

Any opinions on that?

alculquicondor commented 2 months ago

Yes, a status message is more important. An event is nice-to-have.