Closed alculquicondor closed 1 month ago
In jobset, we did look at https://github.com/kubernetes-sigs/kubectl-validate to help with validation of these fields.
Not sure if this would be helpful here as you could try creating these objects and if they fail they you bubble of the error as a condition or event.
The error could come from a webhook, for which kubectl-validate wouldn't help.
+1 for the feature
The starting point could be to record any ProvReq creation errors here, or a level up errors in the check's message.
/assign
I see the event PR, but wondering if this is enough.
In particular, events are temporary objects, so it is not clear if an admin would notice them. OTOH the ProvReq creation is most likely re-attempted, so the event will be generated continuously, so should be easy to notice.
We could also explore the option of exposing the error as a status in the Message
field in the AdmissionCheckState: https://github.com/kubernetes-sigs/kueue/blob/e971646642858835bf7050a514205155b7929a82/apis/kueue/v1beta1/workload_types.go#L249.
Any opinions on that?
Yes, a status message is more important. An event is nice-to-have.
What would you like to be added:
When a ProvReq or PodTemplate is rejected by a webhook, Kueue just logs the error and continues retrying. These errors will not be visible to end-users and they might just interpret them as "kueue is stuck". We should communicate these errors in the Workload object, maybe even produce an event?
Why is this needed:
A cloud provider could have a webhook to validate PodTemplates created for ProvisioningRequests.
These errors need to be surfaced to users so they can fix any problem about them.
Completion requirements:
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.