ocp 3 consple pods do not start

wrenkredhat2 commented 5 months ago

Hello all,

due to an Installation of a Vulcano.sh istance ocp3 did become inoperative.

The obvious incident is, that the login-page is not avalible anymore.

The issue did come in when the Vulcano-Instance became uninstalled, but the uninstall did not sweep the validating and mutating webhooks while the services assiciated with the processing have been removed.

I deleted those manually, however those still reside in the Memory.

Thefore i believe the masternodes have to be gracefully restarted or at least some pods:

The current errormessage is: m18s Warning FailedCreate replicaset/oauth-openshift-7b67db7d95 Error creating: Internal error occurred: failed calling webhook "validatepod.volcano.sh": failed to call webhook: Post "https://volcano-admission-service.wrenk-volcano-system.svc:443/pods/validate?timeout=10s": service "volcano-admission-service" not found

I deleted those webhooks but the cluster sill asks for the services to exist.

and i believe this is applicable for all pods for now.

the whole story you can find in: https://redhat-internal.slack.com/archives/C04J8QF8Y83/p1706100817959359

I want to apologizes for this Situation -- i should have tested the before in a sandboxenvironemt.

please reconcile this as i do not know excalty how to do this and i do not want to add more harm to it.

Thank you !

Wolfgang

DanielFroehlich commented 5 months ago

I did reboot master nodes with no luck - seems that no pods can be started due to the admission hook not responding. @rbo, we once again need your help. kubeconfig for client-auth access is on stormshiftdeploy in the usual dir.

github-actions[bot] commented 5 months ago

Heads up @cluster/ocp3-admin - the "cluster/ocp3" label was applied to this issue.

wrenkredhat2 commented 5 months ago

I did deleted the very last volcano-admission-service and now im waiting for the sceduler to reconsildate automatically. If not I'll restart the masters gracefully. is this OK ?

wrenkredhat2 commented 5 months ago

now the console seems to be starting: 4s Normal SuccessfulCreate replicaset/oauth-openshift-5bb5f4f579 Created pod: oauth-openshift-5bb5f4f579-w6mp7 4s Normal SuccessfulCreate replicaset/oauth-openshift-5bb5f4f579 Created pod: oauth-openshift-5bb5f4f579-x8hj8

DanielFroehlich commented 5 months ago

sure, the cluster is broken, feel free to restart masters as you like.

wrenkredhat2 commented 5 months ago

The problem is fixed now. How ?---- I deleted all validatingwebhookconfigurations and admissionwebhookconfigs starting with "volcano-...." As this did not automatically reconsile the cluster, i followed to restart the masters: https://access.redhat.com/solutions/6089061 Carefullly one by one. After a few minutes the console-login worked. admin-login worked, and the eventlog seem to be noḿal.

stormshift / support

ocp 3 consple pods do not start #153