Closed tumido closed 3 years ago
It would be neat if there's a way to know in advance if a change will cause such action. In this case it is triggered by a simple secret update. Cause: https://github.com/operate-first/apps/pull/753
KB article: https://access.redhat.com/solutions/4902871
MOC Infra 2/3 nodes restarted ACM is running again
MOC Infra fully restarted and ready ACM started auto update from 2.2.1 to 2.2.3, that delayed config propagation to the other clusters, waiting for it.
pods seems to be stuck at pending on the Infra cluster:
Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
after the ACM upgrade management-ingress-
pods are complaining:
2021/06/15 15:02:47 reverseproxy.go:437: http: proxy error: x509: certificate is valid for multicloud-console.apps.moc-infra.massopen.cloud, not localhost
This makes the https://multicloud-console.apps.moc-infra.massopen.cloud/ unresponsive
cc @cdoan1 any idea where that comes from?
management-ingress
deployment generated from the operator:Based on presence of byo-ingress-tls-secret
It seems to be related to this section of docs:
https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.0/html/security/security#replacing-the-management-ingress-certificates
@cdoan1 do you remember if you set this during the initial install of the ACM here? It seems so. What can we do to fix this?
@tumido that was probably Ilana and me rather than Chris. Let's see if we can just delete the secret.
@tumido acm is available again, although you may have to bypass an HSTS error in your browser.
Yeah, I've managed to get to the same state before as well, but considered it not a full fix, so I've reverted it before :disappointed: :smirk:
We need to get this fixed properly to make ACM fully usable again.
@tumido I think this is a full fix for this incident.
Getting an appropriate SSL certificate configured should probably be a separate issue.
anyways.. pull secrets are not propagating to the other clusters via ACM... Asking ACM more questions:
Are they even supposed to propagate to managed clusters? Or are these just used by the initial install?
we'll see :slightly_smiling_face: if not, it's an easy change on our end. I can prepare a PR "just in case".
In case ACM doesn't feel like syncing pull secrets we can just use https://github.com/operate-first/apps/pull/755
Pull secret change is propagated now no node drain observed in 4.7 clusters, seems like it was only a OCP 4.6 thing.
We've recently had to apply a node config change to push a new pull secret. This causes nodes to reboot to apply the change. Change is rolling out now.
A sideeffect of the old pull request was that operator hub and ACM was down on the infra cluster.