Closed erwbgy closed 2 years ago
An example (using Contour 1.21.0) where the Status says that the Secret does not exist but it does:
$ kubectl describe httpproxy global | grep Message
Message: Spec.VirtualHost.TLS Secret "newtls" is invalid: Secret not found
Message: At least one error present, see Errors for details
$ kubectl get secret "newtls"
NAME TYPE DATA AGE
newtls kubernetes.io/tls 2 3h24m
According to #3496 this can also be because the Secret is considered invalid but I don't see anything obviously wrong with it:
$ kubectl describe secret "newtls"
Name: newtls
Namespace: ...
Labels: <none>
Annotations: cert-manager.io/alt-names:
...
cert-manager.io/certificate-name: global
cert-manager.io/common-name: ...
cert-manager.io/ip-sans:
cert-manager.io/issuer-group: cert-manager.io
cert-manager.io/issuer-kind: ClusterIssuer
cert-manager.io/issuer-name: ...
cert-manager.io/uri-sans:
Type: kubernetes.io/tls
Data
====
tls.crt: 6245 bytes
tls.key: 1679 bytes
$ kubectl get secret newtls -o json | jq -r '.data["tls.crt"]' | base64 -d | showcerts2 /dev/stdin
---
X.509v3 TLS Certificate (RSA 2048) [Serial: 6690...9970]
...
Valid from: 2022-06-23T21:40:49Z
to: 2024-06-22T21:40:49Z
---
X.509v3 Intermediate CA Certificate (RSA 2048) [Serial: 1605...1715]
...
Valid from: 2015-10-07T15:22:23Z
to: 2025-10-07T15:32:23Z
---
X.509v3 Root CA Certificate (RSA 4096) [Serial: 2177...9169]
...
Valid from: 2015-10-07T14:06:52Z
to: 2035-10-07T14:11:56Z
Contour logs:
$ stern contour- | grep -i newtls
+ contour-786b48f464-hz9h4 › contour
+ contour-786b48f464-hfppz › contour
contour-786b48f464-hfppz contour time="2022-06-23T21:50:56Z" level=error msg="invalid TLS certificate: failed to locate certificate" context=KubernetesCache kind=Secret name=newtls namespace=... version=v1
contour-786b48f464-hfppz contour time="2022-06-24T09:17:25Z" level=error msg="stream terminated" connection=1114 context=xds error="context canceled" node_id=envoy-d6845759f-bxjrl node_version=v1.22.0 resource_names="[.../newtls/ced75e6a4e]" response_nonce=1634 type_url=type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret version_info=1634
Can you share tls.crt
here as attachment? Somehow the validation fails on that since it prints failed to locate certificate
.
I can't share that specific tls.crt but after restarting the Contour Deployment it validated fine. The common Name and subjectAltNames were both set and the certificate was valid. Is there anything else that I need to look for?
I was searching cert-manager issues to find out if certificate could be missing during the certificate enrollment, but it seems now cert-manager should create the secret only after enrollment succeeds.
Just in case, I tested creating secret with empty tls.crt
file, which results in failed to locate certificate
as expected. Updating the certificate back to the secret did trigger the HTTPProxy
status update and everything worked again. So, it seems the problem is not that simple.
tbh it sounds like we might have a problem with caches going stale somewhere. If this happens again, could you see if making some minor change to the Secret (like adding a dummy annotation) fixes it?
We made some changes a while ago to be more aggressive about dropping Secrets from the caches, this will help us see if there's a bug there. (Poking the Secret should generate a change event from the apiserver, and if Contour has stale info for it, then that should result in it being fixed. If poking the Secret doesn't do anything, then it's not that.)
Thanks @youngnick, will do.
We just had a live one:
$ kubectl get httpproxy
NAME FQDN TLS SECRET STATUS STATUS DESCRIPTION
app app.ns.domain.com app-tls invalid At least one error present, see Errors for details
app-alias preprod.app.domain.com app-tls invalid At least one error present, see Errors for details
prometheus prometheus.ns.domain.com valid Valid HTTPProxy
$ kubectl get secret app-tls
NAME TYPE DATA AGE
app-tls kubernetes.io/tls 2 50m
Adding an annotation and label to the Secret did not make any difference so after 5 minutes we restarted the contour Deployment and the HTTPProxy objects immediately became valid:
$ kubectl get httpproxy
NAME FQDN TLS SECRET STATUS STATUS DESCRIPTION
app app.ns.domain.com app-tls valid Valid HTTPProxy
app-alias preprod.app.domain.com app-tls valid Valid HTTPProxy
prometheus prometheus.ns.domain.com valid Valid HTTPProxy
Key and certificate look okay:
$ kubectl get secret app-tls -o json | jq -r '.data["tls.key"]' | base64 -d | openssl rsa -in /dev/stdin -check -noout
RSA key ok
$ kubectl get secret app-tls -o json | jq -r '.data["tls.crt"]' | base64 -d | step certificate inspect /dev/stdin
Certificate:
Data:
Version: 3 (0x2)
Serial Number: 646... (0x1dcw...)
Signature Algorithm: SHA256-RSA
Issuer: O=CA
Validity
Not Before: Jul 7 06:33:49 2022 UTC
Not After : Jul 6 06:33:49 2024 UTC
Subject: CN=preprod.app.domain.com
Subject Public Key Info:
Public Key Algorithm: RSA
Public-Key: (2048 bit)
Modulus:
af:fc:fc:d2:49:be:2b:df:c6:30:4f:16:a0:9a:50:
74:57:e3:24:7a:76:57:20:f6:31:52:62:d7:39:fa:
03:ce:04:14:ff:1c:8f:ff:84:de:3f:54:27:91:e0:
89:a4:6c:95:75:9e:b7:4d:65:6f:58:58:09:8c:06:
11:42:ae:a0:33:62:6f:7c:5a:c9:13:10:57:2d:6b:
a6:05:ae:6b:7b:38:bc:d4:a6:45:4b:1a:dd:85:d7:
9f:fb:bf:36:13:d7:f1:df:79:e9:70:13:9c:94:cd:
d3:94:7e:f0:60:c9:32:67:67:ca:e0:6f:9b:f0:d9:
dc:00:cf:03:1c:7d:88:48:68:48:ae:83:46:e5:fc:
f4:a5:d8:51:78:08:b8:d6:c5:8c:43:7a:aa:fd:2c:
fa:e9:e6:88:03:c6:48:bd:81:e2:31:e9:20:8a:70:
2a:72:f1:a4:78:b8:21:e4:03:d1:d7:cd:65:b4:52:
48:74:3e:8f:93:1a:32:76:2f:5d:c8:db:8c:48:b6:
c9:19:67:65:c1:7e:fc:c6:4f:c1:ee:e7:33:c7:3a:
92:1f:7a:4a:26:e3:76:69:91:40:c1:9d:29:0f:47:
67:f9:a7:5f:11:c8:44:5c:5f:63:1a:ad:ac:07:1e:
61:e9:14:08:fa:d1:d5:7c:d5:f1:7e:ab:f1:4b:f4:
f9
Exponent: 65537 (0x10001)
X509v3 extensions:
X509v3 Key Usage:
Digital Signature, Key Encipherment
X509v3 Subject Alternative Name:
DNS:preprod.app.domain.com, DNS:app.ns.domain.com
X509v3 Subject Key Identifier:
98:7A:25:01:4E:89:A0:F3:35:E4:C1:E8:F0:23:BE:36:20:2E:F3:58
X509v3 Authority Key Identifier:
keyid:91:D1:46:82:DB:2A:92:F8:E4:39:27:78:16:36:11:F9:FB:0D:0B:16
X509v3 CRL Distribution Points:
Full Name:
URI:http://...
Authority Information Access:
OCSP - URI:http://...
CA Issuers - URI:http://...
1.3.6.1.4.1.311.21.7:
0..&+.....7....C...b.ř=.......].0...n...:..d...
X509v3 Extended Key Usage:
Server Authentication
1.3.6.1.4.1.311.21.10:
0.0...+.......
Signature Algorithm: SHA256-RSA
96:2c:60:02:d9:f1:b0:89:ea:23:ae:b3:c7:d7:00:96:fe:d7:
aa:b7:3f:ed:2c:c8:c6:79:9c:af:01:90:e0:7f:0b:bb:87:e5:
ab:19:0c:a5:b7:49:12:46:5f:b0:ee:58:e3:9f:b0:75:dd:c3:
96:e9:a0:33:d9:ca:86:7e:1e:21:e2:b5:21:a8:a9:1f:c9:bf:
8d:ad:1d:98:23:38:98:aa:e6:65:e4:15:b7:a0:8b:a2:e5:25:
a6:dd:15:4e:c2:5e:32:dd:96:31:d6:e2:6c:ed:07:48:38:78:
73:d3:6f:35:ba:26:e7:69:4b:b6:b9:61:16:8c:7c:0e:50:01:
b3:c8:57:e6:3b:8f:31:05:1a:be:78:13:eb:6a:b1:6e:82:ed:
6c:7f:4f:d9:5d:27:9e:01:29:42:ea:ec:e0:a3:58:3d:a2:c2:
5d:47:97:84:4b:6f:61:8d:7d:7e:2a:94:b7:35:56:e2:f5:35:
6d:3a:5e:c1:f2:ea:62:2e:ae:f1:be:b1:0e:a4:e9:ef:4a:df:
03:e8:e6:2c:30:25:a2:1b:1e:6d:bb:4c:c0:c4:89:8a:f2:3e:
36:af:ce:3c:49:d1:18:c3:e6:9e:aa:d2:f3:39:f7:c0:13:6d:
e0:79:e6:7f:b2:c5:3e:fa:b0:93:78:5e:9f:c0:a2:dc:bb:dd:
a0:68:36:8a
Are we possibly being hit by kubernetes #109115?
If so, Contour v1.21.1 uses client-go 0.23.1 (16 Dec 2021) that does not include a fix for this (29 Mar 2022). Looks like we'd need to update to 0.23.8 or 0.24.2 (17 Jun 2022) - main has 0.24.1.
That would certainly seem to fit with what you're seeing - if we bump main, can you check with that? I'd like to avoid cutting a 1.21.2 unless we are pretty sure that the client-go upgrade will fix the issue.
That would certainly seem to fit with what you're seeing - if we bump main, can you check with that?
Yes, we can certainly test with main.
I'd like to avoid cutting a 1.21.2 unless we are pretty sure that the client-go upgrade will fix the issue.
Makes sense.
Thanks for merging @skriss. Initial testing using the image from main indicates the issue is resolved, but we'll wait for several more hours of testing before declaring victory.
Yesterday I removed the Secret object referenced by an HTTPProxy and the status changed to invalid as expected:
$ kubectl get httpproxy explorer -o json | jq -r '.status.conditions'
[
{
"errors": [
{
"message": "Spec.VirtualHost.TLS Secret \"explorer-tls\" is invalid: Secret not found",
"reason": "SecretNotValid",
"status": "True",
"type": "TLSError"
}
],
"lastTransitionTime": "2022-07-08T15:44:37Z",
"message": "At least one error present, see Errors for details",
"observedGeneration": 1,
"reason": "ErrorPresent",
"status": "False",
"type": "Valid"
}
]
Today I recreated the Secret object but that status has not updated:
$ kubectl get secret explorer-tls -o yaml | grep -P '\s+name:|^type:|creationTimestamp:'
creationTimestamp: "2022-07-09T08:54:20Z"
name: explorer-tls
type: kubernetes.io/tls
We're also seeing status updates on newly created HTTPProxy objects taking a really long time - for example:
time="2022-07-09T10:17:34Z" level=info msg="[...] PASS: waiting for HTTPProxy to become valid, took 4m30.005565285s"
time="2022-07-09T10:19:05Z" level=info msg="[...] PASS: waiting for HTTPProxy to become valid, took 1m40.005384188s"
time="2022-07-09T10:21:31Z" level=info msg="[...] PASS: waiting for HTTPProxy to become valid, took 2m10.004783598s"
time="2022-07-09T10:24:22Z" level=info msg="[...] PASS: waiting for HTTPProxy to become valid, took 3m0.006428992s"
Some more data points:
Example of an HTTPProxy in a bad state that never recovers even if the HTTPProxy and Secret are deleted and recreated:
$ kubectl get httpproxy --no-headers
new-explorer new-explorer.domain1 new-explorer-tls invalid At least one error present, see Errors for details
new-explorer-preprod new-explorer.domain2 new-explorer-tls invalid At least one error present, see Errors for details
$ kubectl get secret new-explorer-tls
NAME TYPE DATA AGE
new-explorer-tls kubernetes.io/tls 3 14h
$ kubectl delete -f httpproxy-explorer.yaml
httpproxy.projectcontour.io "new-explorer" deleted
$ kubectl get httpproxy
NAME FQDN TLS SECRET STATUS STATUS DESCRIPTION
new-explorer-preprod new-explorer.domain2 new-explorer-tls invalid At least one error present, see Errors for details
$ kubectl apply -f httpproxy-explorer.yaml
httpproxy.projectcontour.io/new-explorer created
$ kubectl get httpproxy --no-headers
new-explorer new-explorer.domain1 new-explorer-tls NotReconciled Waiting for controller
new-explorer-preprod new-explorer.domain2 new-explorer-tls invalid At least one error present, see Errors for details
$ kubectl get httpproxy --no-headers -w
new-explorer new-explorer.domain1 new-explorer-tls NotReconciled Waiting for controller
new-explorer-preprod new-explorer.domain2 new-explorer-tls invalid At least one error present, see Errors for details
new-explorer new-explorer.domain1 new-explorer-tls invalid At least one error present, see Errors for details
$ kubectl delete -f certificate-explorer.yaml
certificate.cert-manager.io "new-explorer" deleted
$ kubectl delete secret new-explorer-tls
secret "new-explorer-tls" deleted
$ kubectl apply -f certificate-explorer.yaml
certificate.cert-manager.io/new-explorer created
$ kubectl get secret new-explorer-tls
NAME TYPE DATA AGE
new-explorer-tls kubernetes.io/tls 3 7s
$ kubectl get httpproxy --no-headers -w
new-explorer new-explorer.domain1 new-explorer-tls invalid At least one error present, see Errors for details
new-explorer-preprod new-explorer.domain2 new-explorer-tls invalid At least one error present, see Errors for details
$ kubectl describe httpproxy new-explorer | grep -w Message:
Message: Spec.VirtualHost.TLS Secret "new-explorer-tls" is invalid: Secret not found
Message: At least one error present, see Errors for details
Then after restarting the contour Deployment the older HTTPProxy status updates okay:
$ kubectl --as system:admin rollout restart deployment/contour
deployment.apps/contour restarted
$ kubectl rollout status deployment/contour
Waiting for deployment "contour" rollout to finish: 1 old replicas are pending termination...
Waiting for deployment "contour" rollout to finish: 1 old replicas are pending termination...
deployment "contour" successfully rolled out
$ kubectl get secret new-explorer-tls
NAME TYPE DATA AGE
new-explorer-tls kubernetes.io/tls 3 15m
$ kubectl get httpproxy -w
NAME FQDN TLS SECRET STATUS STATUS DESCRIPTION
new-explorer new-explorer.domain1 new-explorer-tls invalid At least one error present, see Errors for details
new-explorer-preprod new-explorer.domain2 new-explorer-tls valid Valid HTTPProxy
It seems like there's something weird going on with Secret resources falling out of reconciliation and staying there. That's the only thing I can think of that would explain this. It's likely that the changes we made to more aggressively drop Secrets from the cache have something to do with it, but we'll need to review the code pretty closely in the absence of a reliable reproduction.
That's a long and indirect way of saying that one of us is going to have to go up to our eyeballs in that Secret handling code and see if we can figure out why some Secrets are being marked as definitely not relevant even across Kubernetes update events.
To try to rule out an issue like https://github.com/projectcontour/contour/pull/4266 I scaled down the number of contour replicas from our normal 3 to 1 and added --disable-leader-election command-line option. But as usual HTTPProxy status updates were really slow and eventually calls to HTTPProxy FQDNs started failing with status updates not happening at all.
While looking through the debug logs I also noticed that there were bursts of status updates for Ingress objects (not HTTPProxy where we have been seeing the issue) - for example:
time="2022-07-21T07:11:17Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:17Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:17Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:17Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:18Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:18Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:19Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:19Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:19Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:19Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:21Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:21Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:21Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
And looking at the status field of this Ingress:
status:
loadBalancer:
ingress:
- hostname: contour
- hostname: router-default.apps.cluster.domain
We set "--ingress-status-address=contour" so the "hostname: contour" is expected and managedFields shows:
- apiVersion: networking.k8s.io/v1
fieldsType: FieldsV1
fieldsV1:
f:status:
f:loadBalancer:
f:ingress: {}
manager: openshift-controller-manager
operation: Update
time: "2022-07-21T07:17:28Z"
Watching this object I see the resourceVersion change continuously so both contour and openshift-controller-manager are modifying the object. The reason for the OpenShift Ingress controller making changes is that the Ingress object only has the projectcontour.io/ingress.class
annotation and it is the default. Adding kubernetes.io/ingress.class
stops the continuous updates. There were only 6 of these Ingress objects but they were generating a lot of updates so this should help.
Fixing the annotations to stop the two ingress controllers fighting over the Ingress status fields resulting in lots of updates to the status update channel resolved the issue.
In summary, no problem with Contour. Thank you to @tsaarni @youngnick, @skriss and others for looking into this.
Wow, thanks for the update @erwbgy, that's really interesting. I have made a mental note to ask people to check the resourceVersion to see if it's changing rapidly in the future.
@youngnick The surprising thing here was that it was Ingress object updates that were causing the issue but it was HTTPProxy objects that were affected. I spent a long time looking at the latter :-)
I think the cause here is that a lot of event churn, regardless of the source, can cause problems for Contour as it stands. It's definitely something that Envoy Gateway is keeping in mind as well.
What steps did you take and what happened:
Create an HTTPProxy, Secret, Service and Deployment - status of HTTPProxy is valid as expected. A few days later delete the HTTPProxy, Secret, Service and Deployment, then only recreate the HTTPProxy, Service and Deployment - status of HTTPProxy goes to Valid which is NOT expected as the Secret object no longer exists. It appears that the previous Secret state is cached and not updated.
We have also seen numerous cases of the opposite problem where an HTTPProxy shows as invalid complaining about a missing Secret when it does actually exist.
This does not always happen but in one relaively busy cluster we see issues consistently after a few days. Restarting Contour is the only way to resolve these issues.
What did you expect to happen:
HTTPProxy Status should accurately reflect the existence of the Secret object.
Anything else you would like to add:
Contour logs show an XDS error for the Secret where the status is incorrect:
Not sure if it is related but running contour with --debug I see a continuous stream of messages from the leader:
Environment:
kubectl version
): v1.21.8+ed4d8fd/etc/os-release
): Red Hat Enterprise Linux CoreOS 48.84.202204202010-0