redhat-na-ssa / demo-ai-gitops-catalog

A catalog for all things GitOps for AI on OpenShift
MIT License
11 stars 12 forks source link

GPU MachineSet job fails silently when aws-creds secret does not exist #75

Closed strangiato closed 4 months ago

strangiato commented 5 months ago

The GPU MachineSet job attempts to validate if the secret aws-cred exists in this line:

https://github.com/redhat-na-ssa/demo-ai-gitops-catalog/blob/96fda70cf3628f5a7fac6031ca977d5a41dd1176/components/operators/gpu-operator-certified/instance/base/aws/setup-machineset.yaml#L77

If it does not exist, the job immediately finishes with a "success" status with no logs or error messages.

If the secret doesn't exist, it would be nice if the job failed and an error message was output stating that it didn't exist.

@codekow your bash is a bit too clever for me to wrap my head around at the end of the day and propose a solution.

codekow commented 5 months ago

Yeah the bash in that pod is copied from the library of functions.

The reason the job exits zero is because the job should complete in a bare metal cluster if it accidentally get's applied. This can definitely be improved.

So you want it to output to logs that it didn't setup anything (secret was missing) but exit 0?

codekow commented 4 months ago

I believe this has been addressed in the updates done previously. Going to close.