nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

obs cluster API has expired certificate #750

Closed computate closed 1 month ago

computate commented 1 month ago

Today the obs cluster API certificate is expired. image

computate commented 1 month ago

@larsks or @jtriley any idea why we are seeing lots of these errors in this pod: https://console-openshift-console.apps.obs.nerc.mghpcc.org/k8s/ns/openshift-operators/pods/cert-manager-7b86568cb8-hdl69/logs

E0930 16:55:51.155876 1 sync.go:190] "propagation check failed" err="DNS record for \"api.obs.nerc.mghpcc.org\" not yet propagated" logger="cert-manager.controller" resource_name="default-api-certificate-5-2133370942-1414786744" resource_namespace="openshift-config" resource_kind="Challenge" resource_version="v1" dnsName="api.obs.nerc.mghpcc.org" type="DNS-01"
larsks commented 1 month ago

@computate no idea, but I'll see if I can figure it out. It looks as if cert-manager is attempting to create a dns record to respond to the dns-01 challenge, but that record never becomes resolveable.

jtriley commented 1 month ago

@computate Just checked the cert via firefox and it looks like it updated:

Screenshot 2024-09-30 at 2 27 21 PM
jtriley commented 1 month ago

Oh, sorry that's the ingress controller. API is indeed expired.

jtriley commented 1 month ago

I double checked the IAM policy for OBS and it looks OK to me. The OBS cluster was configured with two domains (obs.nerc and apps.obs.nerc) instead of just one for obs.nerc.mghpcc.org. I wonder if it's selecting the wrong zone for some reason?

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "route53:GetChange",
            "Resource": "arn:aws:route53:::change/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "route53:ChangeResourceRecordSets",
                "route53:ListResourceRecordSets"
            ],
            "Resource": [
                "arn:aws:route53:::hostedzone/Z01584463FARBPKKZ6GLA",
                "arn:aws:route53:::hostedzone/Z01587362S8JQQ08E8LB2"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "route53:ListHostedZonesByName",
            "Resource": "*"
        }
    ]
}

Those two zones are:

Z01587362S8JQQ08E8LB2 apps.obs.nerc.mghpcc.org.        3
Z01584463FARBPKKZ6GLA obs.nerc.mghpcc.org.             5

The two-zone setup was configured before we delegated the DNS to Harvard-URC route53 instance. These days I'm configuring a single zone per cluster moving forward.

larsks commented 1 month ago

@jtriley it looks like there may be a domain configuration issue; according to AWS, these are the nameservers for apps.obs.nerc.mghpcc.org:

        {
            "Name": "apps.obs.nerc.mghpcc.org.",
            "Type": "NS",
            "TTL": 172800,
            "ResourceRecords": [
                {
                    "Value": "ns-329.awsdns-41.com."
                },
                {
                    "Value": "ns-1207.awsdns-22.org."
                },
                {
                    "Value": "ns-1949.awsdns-51.co.uk."
                },
                {
                    "Value": "ns-966.awsdns-56.net."
                }
            ]
        },

But DNS tells us something different:

$ host -t ns apps.obs.nerc.mghpcc.org
apps.obs.nerc.mghpcc.org name server ns-1220.awsdns-24.org.
apps.obs.nerc.mghpcc.org name server ns-44.awsdns-05.com.
apps.obs.nerc.mghpcc.org name server ns-1712.awsdns-22.co.uk.
apps.obs.nerc.mghpcc.org name server ns-1003.awsdns-61.net.

And indeed, if I create a TXT record for foo.apps.obs.nerc.mghpcc.org, querying the nameservers configured in DNS shows:

$ host -t txt foo.apps.obs.nerc.mghpcc.org ns-1220.awsdns-24.org.
Using domain server:
Name: ns-1220.awsdns-24.org.
Address: 205.251.196.196#53
Aliases:

foo.apps.obs.nerc.mghpcc.org has no TXT record

But querying one of the servers that AWS reports returns the record successfully:

$ host -t txt foo.apps.obs.nerc.mghpcc.org ns-329.awsdns-41.com.
Using domain server:
Name: ns-329.awsdns-41.com.
Address: 205.251.193.73#53
Aliases:

foo.apps.obs.nerc.mghpcc.org descriptive text "This is a test"
larsks commented 1 month ago

What's puzzling is that both sets of servers respond for the apps.obs.nerc.mghpcc.org domain (but only the ones reported by the aws api seem to get updates).

hakasapl commented 1 month ago

Do we have any route53 instance that has those nameservers? It almost sounds like that domain is attached to the wrong route53 instance or something (I guess that's a dns conf issue if so but I'm not sure how that is managed currently)

jtriley commented 1 month ago

The zone delegation looks OK to me according to cli53:

$ cli53 export nerc.mghpcc.org | grep -i ^obs
obs     300     IN      NS      ns-1347.awsdns-40.org.
obs     300     IN      NS      ns-160.awsdns-20.com.
obs     300     IN      NS      ns-777.awsdns-33.net.
obs     300     IN      NS      ns-1729.awsdns-24.co.uk.

$ cli53 export obs.nerc.mghpcc.org
$ORIGIN obs.nerc.mghpcc.org.
@       900     IN      SOA     ns-1347.awsdns-40.org. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400
@       172800  IN      NS      ns-1347.awsdns-40.org.
@       172800  IN      NS      ns-160.awsdns-20.com.
@       172800  IN      NS      ns-777.awsdns-33.net.
@       172800  IN      NS      ns-1729.awsdns-24.co.uk.
api-int 300     IN      A       10.30.9.15
api     300     IN      A       199.94.63.7
apps    300     IN      NS      ns-1220.awsdns-24.org.
apps    300     IN      NS      ns-44.awsdns-05.com.
apps    300     IN      NS      ns-1712.awsdns-22.co.uk.
apps    300     IN      NS      ns-1003.awsdns-61.net.

$ cli53 export apps.obs.nerc.mghpcc.org
$ORIGIN apps.obs.nerc.mghpcc.org.
@       900     IN      SOA     ns-1220.awsdns-24.org. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400
@       172800  IN      NS      ns-1220.awsdns-24.org.
@       172800  IN      NS      ns-44.awsdns-05.com.
@       172800  IN      NS      ns-1712.awsdns-22.co.uk.
@       172800  IN      NS      ns-1003.awsdns-61.net.
*       300     IN      A       199.94.63.8

Also querying the soa/ns from dig against google DNS:

$ dig +short obs.nerc.mghpcc.org soa @8.8.8.8
ns-1347.awsdns-40.org. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400

$ dig +short obs.nerc.mghpcc.org ns @8.8.8.8
ns-1729.awsdns-24.co.uk.
ns-1347.awsdns-40.org.
ns-160.awsdns-20.com.
ns-777.awsdns-33.net.

$ dig +short apps.obs.nerc.mghpcc.org soa @8.8.8.8
ns-1220.awsdns-24.org. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400

$ dig +short apps.obs.nerc.mghpcc.org ns @8.8.8.8
ns-1220.awsdns-24.org.
ns-1003.awsdns-61.net.
ns-44.awsdns-05.com.
ns-1712.awsdns-22.co.uk.
hakasapl commented 1 month ago

Check out this aws page

Specifically, the section about if you have 2 hosted zones with the same name seems like it behaves the same as our issue (the last section on that page)

larsks commented 1 month ago

@hakasapl I had that thought too, but it doesn't look like that's the case:

$ aws route53 list-hosted-zones-by-name | jq '.HostedZones[]|.Name' -r | grep apps
apps.obs.nerc.mghpcc.org.
apps.shift.nerc.mghpcc.org.
hakasapl commented 1 month ago

@larsks ah, okay. There goes that idea

larsks commented 1 month ago

@jtriley the output of cli53 export is less interesting because that's just querying the API, and should thus return the same results I saw using the aws cli. E.g., for the apps domain, we see the nameserver records I reported in my earlier comment:

$ aws route53 list-resource-record-sets --hosted-zone-id Z064955314WLRARFU1D54 | jq -r '.ResourceRecordSets[] | select(.Type == "NS")|.ResourceRecords[].Value'
ns-329.awsdns-41.com.
ns-1207.awsdns-22.org.
ns-1949.awsdns-51.co.uk.
ns-966.awsdns-56.net.

The problem is that we see something different when asking DNS. Taking your dig command as an example, but for the apps domain:

$ dig +short apps.obs.nerc.mghpcc.org ns @8.8.8.8
ns-1003.awsdns-61.net.
ns-1712.awsdns-22.co.uk.
ns-44.awsdns-05.com.
ns-1220.awsdns-24.org.

It looks like we see the same disrepancy for the obs domain. Querying the API, we see:

$ aws route53 list-resource-record-sets --hosted-zone-id Z06267822LLEMV27APZA0 | jq -r '.ResourceRecordSets[] | select(.Type == "NS")|select(.Name == "obs.nerc.mghpcc.org.")|.ResourceRecords[].Value'
ns-1296.awsdns-34.org.
ns-1963.awsdns-53.co.uk.
ns-460.awsdns-57.com.
ns-760.awsdns-31.net.

But dig tells us:

$ dig +short apps.obs.nerc.mghpcc.org ns @8.8.8.8
ns-44.awsdns-05.com.
ns-1712.awsdns-22.co.uk.
ns-1220.awsdns-24.org.
ns-1003.awsdns-61.net.

I notice that we're getting identical results here for apps.obs.nerc.mghpcc.org and obs.mghpcc.org.

If I create a new record for bar.obs.nerc.mghpcc.org in the obs.nerc.mghpcc.org domain:

$ aws route53 list-resource-record-sets --hosted-zone-id Z06267822LLEMV27APZA0 | jq '.ResourceRecordSets[]|select(.Name=="bar.obs.nerc.mghpcc.org.")'
{
  "Name": "bar.obs.nerc.mghpcc.org.",
  "Type": "A",
  "TTL": 300,
  "ResourceRecords": [
    {
      "Value": "1.2.3.4"
    }
  ]
}

We see that even after several minutes it hasn't shown up in those servers:

$ dig +short bar.obs.nerc.mghpcc.org a @ns-44.awsdns-05.com.
$

But it does show up in the servers listed in the api:

$ dig +short bar.obs.nerc.mghpcc.org a @ns-1296.awsdns-34.org.
1.2.3.4
jtriley commented 1 month ago

Just noting a couple of things we identified:

  1. The credentials and zone ids @larsks was using while investigating the issue were from the old MGHPCC-hosted route53 instance before we delegated nerc.mghpcc.org to Harvard URC route53 instance.
  2. The obs cluster is missing the aws-route53-credentials externalsecret. Likely this was configured manually at cluster creation time and we missed a spot in the nerc-ocp-config manifests to include that externalsecret.

We'll need to manually update the credentials to fix the API cert and then make a PR to the nerc-ocp-config repo to include the route53 credentials externalsecret.

larsks commented 1 month ago

The credentials and zone ids @larsks was using while investigating the issue were from the old MGHPCC-hosted route53 instance...

...because these are the credentials that were in the vault. Justin has updated the vault with appropriate credentials, and I have manually edited the route53 secret on the cluster. It looks the API certificate is now valid:

$ k -n openshift-config get certificate
NAME                      READY   SECRET                    AGE
default-api-certificate   True    default-api-certificate   243d
$ curl https://api.obs.nerc.mghpcc.org:6443/healthz
ok
jtriley commented 1 month ago

Justin has updated the vault with appropriate credentials, and I have manually edited the route53 secret on the cluster. It looks the API certificate is now valid:

Just noting I updated those credentials a while back when we first delegated nerc.mghpcc.org to the Harvard URC route53 instance. As I mentioned above, there is currently no external secret defined on the OBS cluster for aws-route53-credentials so it never updated automatically. I'm drafting a PR now to fix that.