Closed computate closed 1 month ago
@larsks or @jtriley any idea why we are seeing lots of these errors in this pod: https://console-openshift-console.apps.obs.nerc.mghpcc.org/k8s/ns/openshift-operators/pods/cert-manager-7b86568cb8-hdl69/logs
E0930 16:55:51.155876 1 sync.go:190] "propagation check failed" err="DNS record for \"api.obs.nerc.mghpcc.org\" not yet propagated" logger="cert-manager.controller" resource_name="default-api-certificate-5-2133370942-1414786744" resource_namespace="openshift-config" resource_kind="Challenge" resource_version="v1" dnsName="api.obs.nerc.mghpcc.org" type="DNS-01"
@computate no idea, but I'll see if I can figure it out. It looks as if cert-manager is attempting to create a dns record to respond to the dns-01 challenge, but that record never becomes resolveable.
@computate Just checked the cert via firefox and it looks like it updated:
Oh, sorry that's the ingress controller. API is indeed expired.
I double checked the IAM policy for OBS and it looks OK to me. The OBS cluster was configured with two domains (obs.nerc and apps.obs.nerc) instead of just one for obs.nerc.mghpcc.org
. I wonder if it's selecting the wrong zone for some reason?
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "route53:GetChange",
"Resource": "arn:aws:route53:::change/*"
},
{
"Effect": "Allow",
"Action": [
"route53:ChangeResourceRecordSets",
"route53:ListResourceRecordSets"
],
"Resource": [
"arn:aws:route53:::hostedzone/Z01584463FARBPKKZ6GLA",
"arn:aws:route53:::hostedzone/Z01587362S8JQQ08E8LB2"
]
},
{
"Effect": "Allow",
"Action": "route53:ListHostedZonesByName",
"Resource": "*"
}
]
}
Those two zones are:
Z01587362S8JQQ08E8LB2 apps.obs.nerc.mghpcc.org. 3
Z01584463FARBPKKZ6GLA obs.nerc.mghpcc.org. 5
The two-zone setup was configured before we delegated the DNS to Harvard-URC route53 instance. These days I'm configuring a single zone per cluster moving forward.
@jtriley it looks like there may be a domain configuration issue; according to AWS, these are the nameservers for apps.obs.nerc.mghpcc.org
:
{
"Name": "apps.obs.nerc.mghpcc.org.",
"Type": "NS",
"TTL": 172800,
"ResourceRecords": [
{
"Value": "ns-329.awsdns-41.com."
},
{
"Value": "ns-1207.awsdns-22.org."
},
{
"Value": "ns-1949.awsdns-51.co.uk."
},
{
"Value": "ns-966.awsdns-56.net."
}
]
},
But DNS tells us something different:
$ host -t ns apps.obs.nerc.mghpcc.org
apps.obs.nerc.mghpcc.org name server ns-1220.awsdns-24.org.
apps.obs.nerc.mghpcc.org name server ns-44.awsdns-05.com.
apps.obs.nerc.mghpcc.org name server ns-1712.awsdns-22.co.uk.
apps.obs.nerc.mghpcc.org name server ns-1003.awsdns-61.net.
And indeed, if I create a TXT record for foo.apps.obs.nerc.mghpcc.org
, querying the nameservers configured in DNS shows:
$ host -t txt foo.apps.obs.nerc.mghpcc.org ns-1220.awsdns-24.org.
Using domain server:
Name: ns-1220.awsdns-24.org.
Address: 205.251.196.196#53
Aliases:
foo.apps.obs.nerc.mghpcc.org has no TXT record
But querying one of the servers that AWS reports returns the record successfully:
$ host -t txt foo.apps.obs.nerc.mghpcc.org ns-329.awsdns-41.com.
Using domain server:
Name: ns-329.awsdns-41.com.
Address: 205.251.193.73#53
Aliases:
foo.apps.obs.nerc.mghpcc.org descriptive text "This is a test"
What's puzzling is that both sets of servers respond for the apps.obs.nerc.mghpcc.org
domain (but only the ones reported by the aws api seem to get updates).
Do we have any route53 instance that has those nameservers? It almost sounds like that domain is attached to the wrong route53 instance or something (I guess that's a dns conf issue if so but I'm not sure how that is managed currently)
The zone delegation looks OK to me according to cli53:
$ cli53 export nerc.mghpcc.org | grep -i ^obs
obs 300 IN NS ns-1347.awsdns-40.org.
obs 300 IN NS ns-160.awsdns-20.com.
obs 300 IN NS ns-777.awsdns-33.net.
obs 300 IN NS ns-1729.awsdns-24.co.uk.
$ cli53 export obs.nerc.mghpcc.org
$ORIGIN obs.nerc.mghpcc.org.
@ 900 IN SOA ns-1347.awsdns-40.org. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400
@ 172800 IN NS ns-1347.awsdns-40.org.
@ 172800 IN NS ns-160.awsdns-20.com.
@ 172800 IN NS ns-777.awsdns-33.net.
@ 172800 IN NS ns-1729.awsdns-24.co.uk.
api-int 300 IN A 10.30.9.15
api 300 IN A 199.94.63.7
apps 300 IN NS ns-1220.awsdns-24.org.
apps 300 IN NS ns-44.awsdns-05.com.
apps 300 IN NS ns-1712.awsdns-22.co.uk.
apps 300 IN NS ns-1003.awsdns-61.net.
$ cli53 export apps.obs.nerc.mghpcc.org
$ORIGIN apps.obs.nerc.mghpcc.org.
@ 900 IN SOA ns-1220.awsdns-24.org. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400
@ 172800 IN NS ns-1220.awsdns-24.org.
@ 172800 IN NS ns-44.awsdns-05.com.
@ 172800 IN NS ns-1712.awsdns-22.co.uk.
@ 172800 IN NS ns-1003.awsdns-61.net.
* 300 IN A 199.94.63.8
Also querying the soa/ns from dig against google DNS:
$ dig +short obs.nerc.mghpcc.org soa @8.8.8.8
ns-1347.awsdns-40.org. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400
$ dig +short obs.nerc.mghpcc.org ns @8.8.8.8
ns-1729.awsdns-24.co.uk.
ns-1347.awsdns-40.org.
ns-160.awsdns-20.com.
ns-777.awsdns-33.net.
$ dig +short apps.obs.nerc.mghpcc.org soa @8.8.8.8
ns-1220.awsdns-24.org. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400
$ dig +short apps.obs.nerc.mghpcc.org ns @8.8.8.8
ns-1220.awsdns-24.org.
ns-1003.awsdns-61.net.
ns-44.awsdns-05.com.
ns-1712.awsdns-22.co.uk.
Check out this aws page
Specifically, the section about if you have 2 hosted zones with the same name seems like it behaves the same as our issue (the last section on that page)
@hakasapl I had that thought too, but it doesn't look like that's the case:
$ aws route53 list-hosted-zones-by-name | jq '.HostedZones[]|.Name' -r | grep apps
apps.obs.nerc.mghpcc.org.
apps.shift.nerc.mghpcc.org.
@larsks ah, okay. There goes that idea
@jtriley the output of cli53 export
is less interesting because that's just querying the API, and should thus return the same results I saw using the aws cli. E.g., for the apps
domain, we see the nameserver records I reported in my earlier comment:
$ aws route53 list-resource-record-sets --hosted-zone-id Z064955314WLRARFU1D54 | jq -r '.ResourceRecordSets[] | select(.Type == "NS")|.ResourceRecords[].Value'
ns-329.awsdns-41.com.
ns-1207.awsdns-22.org.
ns-1949.awsdns-51.co.uk.
ns-966.awsdns-56.net.
The problem is that we see something different when asking DNS. Taking your dig
command as an example, but for the apps
domain:
$ dig +short apps.obs.nerc.mghpcc.org ns @8.8.8.8
ns-1003.awsdns-61.net.
ns-1712.awsdns-22.co.uk.
ns-44.awsdns-05.com.
ns-1220.awsdns-24.org.
It looks like we see the same disrepancy for the obs domain. Querying the API, we see:
$ aws route53 list-resource-record-sets --hosted-zone-id Z06267822LLEMV27APZA0 | jq -r '.ResourceRecordSets[] | select(.Type == "NS")|select(.Name == "obs.nerc.mghpcc.org.")|.ResourceRecords[].Value'
ns-1296.awsdns-34.org.
ns-1963.awsdns-53.co.uk.
ns-460.awsdns-57.com.
ns-760.awsdns-31.net.
But dig
tells us:
$ dig +short apps.obs.nerc.mghpcc.org ns @8.8.8.8
ns-44.awsdns-05.com.
ns-1712.awsdns-22.co.uk.
ns-1220.awsdns-24.org.
ns-1003.awsdns-61.net.
I notice that we're getting identical results here for apps.obs.nerc.mghpcc.org
and obs.mghpcc.org
.
If I create a new record for bar.obs.nerc.mghpcc.org
in the obs.nerc.mghpcc.org
domain:
$ aws route53 list-resource-record-sets --hosted-zone-id Z06267822LLEMV27APZA0 | jq '.ResourceRecordSets[]|select(.Name=="bar.obs.nerc.mghpcc.org.")'
{
"Name": "bar.obs.nerc.mghpcc.org.",
"Type": "A",
"TTL": 300,
"ResourceRecords": [
{
"Value": "1.2.3.4"
}
]
}
We see that even after several minutes it hasn't shown up in those servers:
$ dig +short bar.obs.nerc.mghpcc.org a @ns-44.awsdns-05.com.
$
But it does show up in the servers listed in the api:
$ dig +short bar.obs.nerc.mghpcc.org a @ns-1296.awsdns-34.org.
1.2.3.4
Just noting a couple of things we identified:
aws-route53-credentials
externalsecret. Likely this was configured manually at cluster creation time and we missed a spot in the nerc-ocp-config
manifests to include that externalsecret.We'll need to manually update the credentials to fix the API cert and then make a PR to the nerc-ocp-config
repo to include the route53 credentials externalsecret.
The credentials and zone ids @larsks was using while investigating the issue were from the old MGHPCC-hosted route53 instance...
...because these are the credentials that were in the vault. Justin has updated the vault with appropriate credentials, and I have manually edited the route53 secret on the cluster. It looks the API certificate is now valid:
$ k -n openshift-config get certificate
NAME READY SECRET AGE
default-api-certificate True default-api-certificate 243d
$ curl https://api.obs.nerc.mghpcc.org:6443/healthz
ok
Justin has updated the vault with appropriate credentials, and I have manually edited the route53 secret on the cluster. It looks the API certificate is now valid:
Just noting I updated those credentials a while back when we first delegated nerc.mghpcc.org to the Harvard URC route53 instance. As I mentioned above, there is currently no external secret defined on the OBS cluster for aws-route53-credentials
so it never updated automatically. I'm drafting a PR now to fix that.
Today the obs cluster API certificate is expired.