nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

bug: obs.nerc Grafana Dashboards showing x509: certificate has expired #800

Closed schwesig closed 2 weeks ago

schwesig commented 2 weeks ago

follow up:

Motivation

When opening a dashboard in Grafana on obs.nerc e.g. https://grafana.apps.obs.nerc.mghpcc.org/d/20241028a/ai4dd-v5?orgId=1 there is an error:

Completion Criteria

Opening the dashboards in Grafana obs, seeing the data and getting no cert error.

Description

Completion dates

Desired - 2024-11-06 Required - 2024-11-08

Image

/CC @schwesig @computate @RH-csaggin @jtriley @larsks

computate commented 2 weeks ago

Also the ACM Observability metrics endpoint on the infra cluster has a different certificate error, where the valid dates are ok:

Image

Image

larsks commented 2 weeks ago

Starting with the second issue first:

The certificate presented by https://observatorium-api-open-cluster-management-observability.apps.nerc-ocp-infra.rc.fas.harvard.edu/api/metrics/v1/default is signed by the observability-server-ca-certificate:

$ urlcert https://observatorium-api-open-cluster-management-observability.apps.nerc-ocp-infra.rc.fas.harvard.edu/api/metrics/v1/default | showcert
sha256 Fingerprint=4A:6C:A5:5C:69:D3:7D:3E:B8:EA:12:D1:5C:3B:D3:A2:AF:15:38:1C:43:5A:1C:23:BF:9E:76:86:9A:08:7A:03
subject=C=US, O=Red Hat, Inc., CN=observability-server-certificate
issuer=C=US, O=Red Hat, Inc., CN=observability-server-ca-certificate
notBefore=Aug 20 14:16:50 2024 GMT
notAfter=Aug 20 14:16:50 2025 GMT
X509v3 Subject Alternative Name:
    DNS:observability-server-certificate, DNS:observability-observatorium-api.open-cluster-management-observability.svc.cluster.local, DNS:observatorium-api-open-cluster-management-observability.apps.nerc-ocp-infra.rc.fas.harvard.edu

That CA isn't going to be trusted by anybody, hence the "certificate issuer is unknown" error. The correct fix is probably to change the corresponding route from passthrough to reencrypt so that the default ingress certificate is exposed to outside clients.

larsks commented 2 weeks ago

Regarding the first problem, which certificate is resulting in the "certificate is expired or not yet valid" error?

computate commented 2 weeks ago

The second problem sounds like ACM Observability suddenly broke with it's passthrough Route TLS handling.

kind: Route
apiVersion: route.openshift.io/v1
metadata:
  name: observatorium-api
  namespace: open-cluster-management-observability
  uid: a7f4bf8b-eba5-456b-b9b8-71e2e1dc4802
  resourceVersion: '1261594129'
  creationTimestamp: '2023-11-02T13:48:51Z'
  annotations:
    openshift.io/host.generated: 'true'
  ownerReferences:
    - apiVersion: observability.open-cluster-management.io/v1beta2
      kind: MultiClusterObservability
      name: observability
      uid: bcc31c98-3269-4ffc-bcfd-76257a9600d0
      controller: true
      blockOwnerDeletion: true
larsks commented 2 weeks ago

Another possible solution would be to configure grafana to trust the observability ca certificate.

computate commented 2 weeks ago

The first one relates to dex and the Oauth configuration for Grafana in vault nerc-ocp-infra/dex/grafanas GF_TLSCLIENTCERT:

Validity
       Not Before: 2023-11-02 13:48:52 +0000 UTC
       Not After : 2024-11-01 13:48:52 +0000 UTC
larsks commented 2 weeks ago

The expired certificate in the oauth-client-secret secret (in the grafana namespace`) looks like it was generated by the observability tools:

$ k extract secret/oauth-client-secret
GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET
GF_AUTH_GENERIC_TLSCACERT
GF_AUTH_GENERIC_TLSCLIENTCERT
GF_AUTH_GENERIC_TLSCLIENTKEY
$ showcert !$
showcert GF_AUTH_GENERIC_TLSCLIENTCERT
sha256 Fingerprint=51:E6:4F:CC:F6:D7:07:17:75:4B:00:F4:37:A3:74:EE:0D:31:EB:97:57:B8:25:DD:9A:A2:49:4E:AD:70:B8:0B
subject=C=US, O=Red Hat, Inc., CN=grafana
issuer=C=US, O=Red Hat, Inc., CN=observability-client-ca-certificate
notBefore=Nov  2 13:48:52 2023 GMT
notAfter=Nov  1 13:48:52 2024 GMT
X509v3 Subject Alternative Name:
    DNS:grafana

Note the issuer entry. This suggests there must be some mechanism to regenerate this certificate.

schwesig commented 2 weeks ago

fyi: https://massopencloud.slack.com/archives/C027TDE52TZ/p1730820535559749

computate commented 2 weeks ago

@larsks @schwesig I updated the certs and keys described in this issue (observability-grafana-certs, observability-server-ca-certs) in nerc-ocp-obs/dex/grafanas vault (GF_TLSCLIENTCERT, GF_TLSCLIENTKEY, GF_TLSCACERT) and restarted the grafana pods to get Grafana working again!

oc --as system:admin -n open-cluster-management-observability get secret/observability-grafana-certs -o jsonpath='{.data.tls\.crt}' | base64 -d
oc --as system:admin -n open-cluster-management-observability get secret/observability-grafana-certs -o jsonpath='{.data.tls\.key}' | base64 -d
oc --as system:admin -n open-cluster-management-observability get secret/observability-server-ca-certs -o jsonpath='{.data.ca\.crt}' | base64 -d

It's still a temporary solution until:

        Validity
            Not Before: Aug 20 14:16:50 2024 GMT
            Not After : Aug 20 14:16:50 2025 GMT
larsks commented 2 weeks ago

@computate @schwesig A neat command for dealing with files embedded in secrets (and configmaps) is the oc extract command; this will extract each key to a file in your local directory:

$ oc  -n open-cluster-management-observability extract secret/observability-grafana-certs
ca.crt
tls.crt
tls.key
$ ls -l
ca.crt  tls.crt  tls.key

Saves you from the whole jsonpath/base64 dance.

schwesig commented 2 weeks ago

FYI: thanks to @RH-csaggin for recommending and shout out to @dcommisso (https://github.com/dcommisso) for writing this great tool https://github.com/dcommisso/certexplorer

schwesig commented 2 weeks ago

can we call this issue closed now? I created a follow up for next year. do we need an issue for finding a different solution?

computate commented 2 weeks ago

You can close this issue @schwesig .

schwesig commented 2 weeks ago

follow up: