nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

obs cluster API has expired certificate #750

Open computate opened 15 hours ago

computate commented 15 hours ago

Today the obs cluster API certificate is expired. image

computate commented 14 hours ago

@larsks or @jtriley any idea why we are seeing lots of these errors in this pod: https://console-openshift-console.apps.obs.nerc.mghpcc.org/k8s/ns/openshift-operators/pods/cert-manager-7b86568cb8-hdl69/logs

E0930 16:55:51.155876 1 sync.go:190] "propagation check failed" err="DNS record for \"api.obs.nerc.mghpcc.org\" not yet propagated" logger="cert-manager.controller" resource_name="default-api-certificate-5-2133370942-1414786744" resource_namespace="openshift-config" resource_kind="Challenge" resource_version="v1" dnsName="api.obs.nerc.mghpcc.org" type="DNS-01"
larsks commented 13 hours ago

@computate no idea, but I'll see if I can figure it out. It looks as if cert-manager is attempting to create a dns record to respond to the dns-01 challenge, but that record never becomes resolveable.

jtriley commented 13 hours ago

@computate Just checked the cert via firefox and it looks like it updated:

Screenshot 2024-09-30 at 2 27 21 PM
jtriley commented 13 hours ago

Oh, sorry that's the ingress controller. API is indeed expired.

jtriley commented 12 hours ago

I double checked the IAM policy for OBS and it looks OK to me. The OBS cluster was configured with two domains (obs.nerc and apps.obs.nerc) instead of just one for obs.nerc.mghpcc.org. I wonder if it's selecting the wrong zone for some reason?

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "route53:GetChange",
            "Resource": "arn:aws:route53:::change/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "route53:ChangeResourceRecordSets",
                "route53:ListResourceRecordSets"
            ],
            "Resource": [
                "arn:aws:route53:::hostedzone/Z01584463FARBPKKZ6GLA",
                "arn:aws:route53:::hostedzone/Z01587362S8JQQ08E8LB2"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "route53:ListHostedZonesByName",
            "Resource": "*"
        }
    ]
}

Those two zones are:

Z01587362S8JQQ08E8LB2 apps.obs.nerc.mghpcc.org.        3
Z01584463FARBPKKZ6GLA obs.nerc.mghpcc.org.             5

The two-zone setup was configured before we delegated the DNS to Harvard-URC route53 instance. These days I'm configuring a single zone per cluster moving forward.