[Incident][MOC][Infra/Zero/Curator] Nodes rebooting due to config change

tumido commented 3 years ago

We've recently had to apply a node config change to push a new pull secret. This causes nodes to reboot to apply the change. Change is rolling out now.

A sideeffect of the old pull request was that operator hub and ACM was down on the infra cluster.

tumido commented 3 years ago

It would be neat if there's a way to know in advance if a change will cause such action. In this case it is triggered by a simple secret update. Cause: https://github.com/operate-first/apps/pull/753

KB article: https://access.redhat.com/solutions/4902871

tumido commented 3 years ago

MOC Infra 2/3 nodes restarted ACM is running again

tumido commented 3 years ago

MOC Infra fully restarted and ready ACM started auto update from 2.2.1 to 2.2.3, that delayed config propagation to the other clusters, waiting for it.

tumido commented 3 years ago

pods seems to be stuck at pending on the Infra cluster:

Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded

tumido commented 3 years ago

after the ACM upgrade management-ingress- pods are complaining:

2021/06/15 15:02:47 reverseproxy.go:437: http: proxy error: x509: certificate is valid for multicloud-console.apps.moc-infra.massopen.cloud, not localhost

This makes the https://multicloud-console.apps.moc-infra.massopen.cloud/ unresponsive

tumido commented 3 years ago

cc @cdoan1 any idea where that comes from?

This is the management-ingress deployment generated from the operator:

```yaml kind: Deployment apiVersion: apps/v1 metadata: annotations: deployment.kubernetes.io/revision: '1' meta.helm.sh/release-name: management-ingress-c0398 meta.helm.sh/release-namespace: acm selfLink: /apis/apps/v1/namespaces/acm/deployments/management-ingress-c0398 resourceVersion: '217548879' name: management-ingress-c0398 uid: 395e50c1-f386-4cdb-a952-abdbda3f4d28 creationTimestamp: '2021-02-20T15:22:01Z' generation: 9 managedFields: - manager: multiclusterhub-operator operation: Update apiVersion: apps/v1 time: '2021-02-20T15:22:12Z' fieldsType: FieldsV1 fieldsV1: 'f:metadata': 'f:labels': 'f:installer.name': {} 'f:installer.namespace': {} - manager: Go-http-client operation: Update apiVersion: apps/v1 time: '2021-02-23T11:44:52Z' fieldsType: FieldsV1 fieldsV1: 'f:metadata': 'f:annotations': .: {} 'f:meta.helm.sh/release-name': {} 'f:meta.helm.sh/release-namespace': {} 'f:labels': 'f:helm.sh/chart': {} 'f:app.kubernetes.io/managed-by': {} 'f:component': {} 'f:app': {} 'f:app.kubernetes.io/name': {} .: {} 'f:release': {} 'f:heritage': {} 'f:app.kubernetes.io/instance': {} 'f:ownerReferences': .: {} 'k:{"uid":"760d83be-a726-4e37-b73a-f6af6cb83704"}': .: {} 'f:apiVersion': {} 'f:blockOwnerDeletion': {} 'f:controller': {} 'f:kind': {} 'f:name': {} 'f:uid': {} 'f:spec': 'f:progressDeadlineSeconds': {} 'f:replicas': {} 'f:revisionHistoryLimit': {} 'f:selector': 'f:matchLabels': .: {} 'f:app': {} 'f:chart': {} 'f:component': {} 'f:heritage': {} 'f:k8s-app': {} 'f:release': {} 'f:strategy': 'f:rollingUpdate': .: {} 'f:maxSurge': {} 'f:maxUnavailable': {} 'f:type': {} 'f:template': 'f:metadata': 'f:annotations': .: {} 'f:productName': {} 'f:labels': 'f:helm.sh/chart': {} 'f:k8s-app': {} 'f:app.kubernetes.io/managed-by': {} 'f:component': {} 'f:chart': {} 'f:app': {} 'f:app.kubernetes.io/name': {} .: {} 'f:release': {} 'f:heritage': {} 'f:app.kubernetes.io/instance': {} 'f:spec': 'f:volumes': .: {} 'k:{"name":"ca-tls-secret"}': .: {} 'f:name': {} 'f:secret': .: {} 'f:defaultMode': {} 'f:secretName': {} 'k:{"name":"tls-secret"}': .: {} 'f:name': {} 'f:secret': .: {} 'f:defaultMode': {} 'f:containers': 'k:{"name":"management-ingress-c0398"}': 'f:volumeMounts': .: {} 'k:{"mountPath":"/var/run/secrets/tls"}': .: {} 'f:mountPath': {} 'f:name': {} 'f:terminationMessagePolicy': {} .: {} 'f:resources': .: {} 'f:requests': .: {} 'f:cpu': {} 'f:memory': {} 'f:livenessProbe': .: {} 'f:failureThreshold': {} 'f:httpGet': .: {} 'f:path': {} 'f:port': {} 'f:scheme': {} 'f:initialDelaySeconds': {} 'f:periodSeconds': {} 'f:successThreshold': {} 'f:timeoutSeconds': {} 'f:env': 'k:{"name":"ENABLE_IMPERSONATION"}': .: {} 'f:name': {} 'f:value': {} 'k:{"name":"POD_NAME"}': .: {} 'f:name': {} 'f:valueFrom': .: {} 'f:fieldRef': .: {} 'f:apiVersion': {} 'f:fieldPath': {} 'k:{"name":"ALLOWED_HOST_HEADERS"}': .: {} 'f:name': {} 'f:value': {} 'k:{"name":"FIPS_ENABLED"}': .: {} 'f:name': {} 'f:value': {} 'k:{"name":"CLUSTER_DOMAIN"}': .: {} 'f:name': {} 'f:value': {} .: {} 'k:{"name":"POD_NAMESPACE"}': .: {} 'f:name': {} 'f:valueFrom': .: {} 'f:fieldRef': .: {} 'f:apiVersion': {} 'f:fieldPath': {} 'k:{"name":"APISERVER_SECURE_PORT"}': .: {} 'f:name': {} 'f:value': {} 'k:{"name":"HOST_HEADERS_CHECK_ENABLED"}': .: {} 'f:name': {} 'f:value': {} 'f:readinessProbe': .: {} 'f:failureThreshold': {} 'f:httpGet': .: {} 'f:path': {} 'f:port': {} 'f:scheme': {} 'f:initialDelaySeconds': {} 'f:periodSeconds': {} 'f:successThreshold': {} 'f:timeoutSeconds': {} 'f:securityContext': .: {} 'f:allowPrivilegeEscalation': {} 'f:runAsNonRoot': {} 'f:terminationMessagePath': {} 'f:imagePullPolicy': {} 'f:ports': .: {} 'k:{"containerPort":8080,"protocol":"TCP"}': .: {} 'f:containerPort': {} 'f:protocol': {} 'k:{"containerPort":8443,"protocol":"TCP"}': .: {} 'f:containerPort': {} 'f:protocol': {} 'f:name': {} 'k:{"name":"oauth-proxy"}': 'f:image': {} 'f:volumeMounts': .: {} 'k:{"mountPath":"/etc/tls/ca"}': .: {} 'f:mountPath': {} 'f:name': {} 'k:{"mountPath":"/etc/tls/private"}': .: {} 'f:mountPath': {} 'f:name': {} 'f:terminationMessagePolicy': {} .: {} 'f:resources': {} 'f:args': {} 'f:readinessProbe': .: {} 'f:failureThreshold': {} 'f:httpGet': .: {} 'f:path': {} 'f:port': {} 'f:scheme': {} 'f:periodSeconds': {} 'f:successThreshold': {} 'f:timeoutSeconds': {} 'f:securityContext': .: {} 'f:allowPrivilegeEscalation': {} 'f:terminationMessagePath': {} 'f:imagePullPolicy': {} 'f:ports': .: {} 'k:{"containerPort":9443,"protocol":"TCP"}': .: {} 'f:containerPort': {} 'f:name': {} 'f:protocol': {} 'f:name': {} 'f:dnsPolicy': {} 'f:serviceAccount': {} 'f:restartPolicy': {} 'f:schedulerName': {} 'f:terminationGracePeriodSeconds': {} 'f:serviceAccountName': {} 'f:securityContext': {} 'f:affinity': .: {} 'f:podAntiAffinity': {} - manager: jetstack-cert-manager operation: Update apiVersion: apps/v1 time: '2021-05-20T15:22:28Z' fieldsType: FieldsV1 fieldsV1: 'f:metadata': 'f:labels': 'f:certmanager.k8s.io/time-restarted': {} 'f:spec': 'f:template': 'f:metadata': 'f:labels': 'f:certmanager.k8s.io/time-restarted': {} - manager: multicluster-operators-subscription operation: Update apiVersion: apps/v1 time: '2021-06-15T13:54:21Z' fieldsType: FieldsV1 fieldsV1: 'f:metadata': 'f:labels': 'f:chart': {} 'f:spec': 'f:template': 'f:metadata': 'f:annotations': 'f:productID': {} 'f:productVersion': {} 'f:labels': 'f:ocm-antiaffinity-selector': {} 'f:spec': 'f:affinity': 'f:podAntiAffinity': 'f:preferredDuringSchedulingIgnoredDuringExecution': {} 'f:containers': 'k:{"name":"management-ingress-c0398"}': 'f:command': {} 'f:image': {} 'f:tolerations': {} 'f:volumes': 'k:{"name":"tls-secret"}': 'f:secret': 'f:secretName': {} - manager: kube-controller-manager operation: Update apiVersion: apps/v1 time: '2021-06-15T15:12:02Z' fieldsType: FieldsV1 fieldsV1: 'f:metadata': 'f:annotations': 'f:deployment.kubernetes.io/revision': {} 'f:status': 'f:availableReplicas': {} 'f:conditions': .: {} 'k:{"type":"Available"}': .: {} 'f:lastTransitionTime': {} 'f:lastUpdateTime': {} 'f:message': {} 'f:reason': {} 'f:status': {} 'f:type': {} 'k:{"type":"Progressing"}': .: {} 'f:lastTransitionTime': {} 'f:lastUpdateTime': {} 'f:message': {} 'f:reason': {} 'f:status': {} 'f:type': {} 'f:observedGeneration': {} 'f:readyReplicas': {} 'f:replicas': {} 'f:updatedReplicas': {} namespace: acm ownerReferences: - apiVersion: apps.open-cluster-management.io/v1 kind: HelmRelease name: management-ingress-c0398 uid: 760d83be-a726-4e37-b73a-f6af6cb83704 controller: true blockOwnerDeletion: true labels: app: management-ingress-c0398 app.kubernetes.io/instance: management-ingress-c0398 release: management-ingress-c0398 certmanager.k8s.io/time-restarted: 2021-5-20.1522 installer.name: multiclusterhub app.kubernetes.io/managed-by: Helm helm.sh/chart: management-ingress installer.namespace: acm app.kubernetes.io/name: management-ingress-c0398 component: management-ingress-c0398 chart: management-ingress-2.2.3 heritage: Helm spec: replicas: 2 selector: matchLabels: app: management-ingress-c0398 chart: management-ingress component: management-ingress-c0398 heritage: Helm k8s-app: management-ingress-c0398 release: management-ingress-c0398 template: metadata: creationTimestamp: null labels: app: management-ingress-c0398 app.kubernetes.io/instance: management-ingress-c0398 release: management-ingress-c0398 certmanager.k8s.io/time-restarted: 2021-5-20.1522 ocm-antiaffinity-selector: managementingress app.kubernetes.io/managed-by: Helm helm.sh/chart: management-ingress app.kubernetes.io/name: management-ingress-c0398 k8s-app: management-ingress-c0398 component: management-ingress-c0398 chart: management-ingress heritage: Helm annotations: productID: management-ingress_2.2.3_00000 productName: management-ingress productVersion: 2.2.3 spec: restartPolicy: Always serviceAccountName: management-ingress-c0398-sa schedulerName: default-scheduler affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 70 podAffinityTerm: labelSelector: matchExpressions: - key: ocm-antiaffinity-selector operator: In values: - managementingress topologyKey: topology.kubernetes.io/zone - weight: 35 podAffinityTerm: labelSelector: matchExpressions: - key: ocm-antiaffinity-selector operator: In values: - managementingress topologyKey: kubernetes.io/hostname terminationGracePeriodSeconds: 30 securityContext: {} containers: - resources: {} readinessProbe: httpGet: path: /oauth/healthz port: 9443 scheme: HTTPS timeoutSeconds: 1 periodSeconds: 10 successThreshold: 1 failureThreshold: 3 terminationMessagePath: /dev/termination-log name: oauth-proxy securityContext: allowPrivilegeEscalation: true ports: - name: public containerPort: 9443 protocol: TCP imagePullPolicy: Always volumeMounts: - name: tls-secret mountPath: /etc/tls/private - name: ca-tls-secret mountPath: /etc/tls/ca terminationMessagePolicy: File image: >- registry.redhat.io/openshift4/ose-oauth-proxy@sha256:3948de88df41ba184c0541146997dbfbc705e2a9489f6433fb8da2858eecd041 args: - '--provider=openshift' - '--upstream=https://localhost:8443' - '--upstream-ca=/etc/tls/ca/tls.crt' - '--https-address=:9443' - '--client-id=multicloudingress' - '--client-secret=multicloudingresssecret' - '--pass-user-bearer-token=true' - '--pass-access-token=true' - '--scope=user:full' - >- -openshift-delegate-urls={"/": {"resource": "projects", "verb": "list"}} - '--skip-provider-button=true' - '--cookie-secure=true' - '--cookie-expire=12h0m0s' - '--cookie-refresh=8h0m0s' - '--tls-cert=/etc/tls/private/tls.crt' - '--tls-key=/etc/tls/private/tls.key' - '--cookie-secret=AAECAwQFBgcICQoLDA0OFw==' - '--openshift-ca=/etc/pki/tls/cert.pem' - >- --openshift-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt - resources: requests: cpu: 200m memory: 256Mi readinessProbe: httpGet: path: /healthz port: 8080 scheme: HTTP initialDelaySeconds: 10 timeoutSeconds: 1 periodSeconds: 10 successThreshold: 1 failureThreshold: 3 terminationMessagePath: /dev/termination-log name: management-ingress-c0398 command: - /management-ingress - '--default-ssl-certificate=$(POD_NAMESPACE)/byo-ingress-tls-secret' - '--configmap=$(POD_NAMESPACE)/management-ingress-c0398' - '--http-port=8080' - '--https-port=8443' livenessProbe: httpGet: path: /healthz port: 8080 scheme: HTTP initialDelaySeconds: 10 timeoutSeconds: 1 periodSeconds: 10 successThreshold: 1 failureThreshold: 3 env: - name: ENABLE_IMPERSONATION value: 'false' - name: APISERVER_SECURE_PORT value: '8001' - name: CLUSTER_DOMAIN value: cluster.local - name: HOST_HEADERS_CHECK_ENABLED value: 'false' - name: ALLOWED_HOST_HEADERS value: >- 127.0.0.1 localhost management-ingress-c0398 management-ingress multicloud-console.apps.moc-infra.massopen.cloud - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: POD_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: FIPS_ENABLED value: 'false' securityContext: runAsNonRoot: true allowPrivilegeEscalation: true ports: - containerPort: 8080 protocol: TCP - containerPort: 8443 protocol: TCP imagePullPolicy: Always volumeMounts: - name: tls-secret mountPath: /var/run/secrets/tls terminationMessagePolicy: File image: >- registry.redhat.io/rhacm2/management-ingress-rhel7@sha256:9d3e5d82199d53c5270aa6760003d69f37b4b2f4acd2a67e213cd184fc3eac06 serviceAccount: management-ingress-c0398-sa volumes: - name: tls-secret secret: secretName: byo-ingress-tls-secret defaultMode: 420 - name: ca-tls-secret secret: secretName: multicloud-ca-cert defaultMode: 420 dnsPolicy: ClusterFirst tolerations: - key: dedicated operator: Exists effect: NoSchedule - key: node-role.kubernetes.io/infra operator: Exists effect: NoSchedule strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 maxSurge: 25% revisionHistoryLimit: 10 progressDeadlineSeconds: 600 status: observedGeneration: 9 replicas: 2 updatedReplicas: 2 readyReplicas: 2 availableReplicas: 2 conditions: - type: Available status: 'True' lastUpdateTime: '2021-06-15T15:12:00Z' lastTransitionTime: '2021-06-15T15:12:00Z' reason: MinimumReplicasAvailable message: Deployment has minimum availability. - type: Progressing status: 'True' lastUpdateTime: '2021-06-15T15:12:02Z' lastTransitionTime: '2021-06-15T14:18:09Z' reason: NewReplicaSetAvailable message: >- ReplicaSet "management-ingress-c0398-7cc55b7459" has successfully progressed. ```

https://console-openshift-console.apps.moc-infra.massopen.cloud/k8s/ns/acm/deployments/management-ingress-c0398/

tumido commented 3 years ago

Based on presence of byo-ingress-tls-secret It seems to be related to this section of docs: https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.0/html/security/security#replacing-the-management-ingress-certificates

@cdoan1 do you remember if you set this during the initial install of the ACM here? It seems so. What can we do to fix this?

larsks commented 3 years ago

@tumido that was probably Ilana and me rather than Chris. Let's see if we can just delete the secret.

larsks commented 3 years ago

@tumido acm is available again, although you may have to bypass an HSTS error in your browser.

tumido commented 3 years ago

Yeah, I've managed to get to the same state before as well, but considered it not a full fix, so I've reverted it before :disappointed: :smirk:

We need to get this fixed properly to make ACM fully usable again.

larsks commented 3 years ago

@tumido I think this is a full fix for this incident.

Getting an appropriate SSL certificate configured should probably be a separate issue.

tumido commented 3 years ago

anyways.. pull secrets are not propagating to the other clusters via ACM... Asking ACM more questions:

https://chat.google.com/room/AAAAWskU424/NC5pE6hoHOU

larsks commented 3 years ago

Are they even supposed to propagate to managed clusters? Or are these just used by the initial install?

tumido commented 3 years ago

we'll see :slightly_smiling_face: if not, it's an easy change on our end. I can prepare a PR "just in case".

tumido commented 3 years ago

In case ACM doesn't feel like syncing pull secrets we can just use https://github.com/operate-first/apps/pull/755

tumido commented 3 years ago

Pull secret change is propagated now no node drain observed in 4.7 clusters, seems like it was only a OCP 4.6 thing.

operate-first / support

[Incident][MOC][Infra/Zero/Curator] Nodes rebooting due to config change #752