HTTPProxy Status incorrect

projectcontour / contour

Contour is a Kubernetes ingress controller using Envoy proxy.

https://projectcontour.io

Apache License 2.0

3.7k stars 672 forks source link

HTTPProxy Status incorrect #4608

Closed erwbgy closed 2 years ago

erwbgy commented 2 years ago

What steps did you take and what happened:

Create an HTTPProxy, Secret, Service and Deployment - status of HTTPProxy is valid as expected. A few days later delete the HTTPProxy, Secret, Service and Deployment, then only recreate the HTTPProxy, Service and Deployment - status of HTTPProxy goes to Valid which is NOT expected as the Secret object no longer exists. It appears that the previous Secret state is cached and not updated.

We have also seen numerous cases of the opposite problem where an HTTPProxy shows as invalid complaining about a missing Secret when it does actually exist.

This does not always happen but in one relaively busy cluster we see issues consistently after a few days. Restarting Contour is the only way to resolve these issues.

What did you expect to happen:

HTTPProxy Status should accurately reflect the existence of the Secret object.

Anything else you would like to add:

Contour logs show an XDS error for the Secret where the status is incorrect:

$ kubectl logs contour-74cbf6dd55-4857n | grep explorer-tls
time="2022-07-04T16:12:09Z" level=error msg="stream terminated" connection=8498 context=xds error="rpc error: code = Canceled desc = context canceled" node_id=envoy-5c7d955d96-5thq6 node_version=v1.22.2 resource_names="[burdik/explorer-tls/88eb5c978f]" response_nonce=661317 type_url=type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret version_info=661317

Not sure if it is related but running contour with --debug I see a continuous stream of messages from the leader:

time="2022-07-04T18:23:03Z" level=debug msg="added ServiceCluster \"burdik/explorer/http\" from DAG" context=endpointstranslator
time="2022-07-04T18:23:03Z" level=debug msg="added ServiceCluster \"burdik/shell/http\" from DAG" context=endpointstranslator
time="2022-07-04T18:23:03Z" level=debug msg="added ServiceCluster \"burdik/sleeper/http\" from DAG" context=endpointstranslator
time="2022-07-04T18:23:03Z" level=debug msg="dropping service cluster with duplicate name \"burdik/explorer/http\"" context=endpointstranslator
time="2022-07-04T18:23:03Z" level=debug msg="dropping service cluster with duplicate name \"burdik/shell/http\"" context=endpointstranslator
time="2022-07-04T18:23:03Z" level=debug msg="dropping service cluster with duplicate name \"burdik/sleeper/http\"" context=endpointstranslator
time="2022-07-04T18:23:04Z" level=debug msg="received a status update" context=StatusUpdateHandler name=shell namespace=burdik
time="2022-07-04T18:23:04Z" level=debug msg="update was a no-op" context=StatusUpdateHandler name=shell namespace=burdik
time="2022-07-04T18:23:11Z" level=debug msg="received a status update" context=StatusUpdateHandler name=explorer namespace=burdik
time="2022-07-04T18:23:11Z" level=debug msg="update was a no-op" context=StatusUpdateHandler name=explorer namespace=burdik
time="2022-07-04T18:23:12Z" level=debug msg="received a status update" context=StatusUpdateHandler name=shell namespace=burdik
time="2022-07-04T18:23:12Z" level=debug msg="update was a no-op" context=StatusUpdateHandler name=shell namespace=burdik
time="2022-07-04T18:23:19Z" level=debug msg="received a status update" context=StatusUpdateHandler name=sleeper namespace=burdik
time="2022-07-04T18:23:19Z" level=debug msg="update was a no-op" context=StatusUpdateHandler name=sleeper namespace=burdik
time="2022-07-04T18:23:27Z" level=debug msg="added ServiceCluster \"burdik/explorer/http\" from DAG" context=endpointstranslator
time="2022-07-04T18:23:27Z" level=debug msg="added ServiceCluster \"burdik/shell/http\" from DAG" context=endpointstranslator
time="2022-07-04T18:23:27Z" level=debug msg="added ServiceCluster \"burdik/sleeper/http\" from DAG" context=endpointstranslator
time="2022-07-04T18:23:27Z" level=debug msg="dropping service cluster with duplicate name \"burdik/explorer/http\"" context=endpointstranslator
time="2022-07-04T18:23:27Z" level=debug msg="dropping service cluster with duplicate name \"burdik/shell/http\"" context=endpointstranslator
time="2022-07-04T18:23:27Z" level=debug msg="dropping service cluster with duplicate name \"burdik/sleeper/http\"" context=endpointstranslator

Environment:

Contour version: v1.21.1
Envoy version: v1.22.2
Kubernetes version: (use kubectl version): v1.21.8+ed4d8fd
Kubernetes installer & version: OpenShift 4.8.
Cloud provider or hardware configuration: on-prem bare metal
OS (e.g. from /etc/os-release): Red Hat Enterprise Linux CoreOS 48.84.202204202010-0

erwbgy commented 2 years ago

An example (using Contour 1.21.0) where the Status says that the Secret does not exist but it does:

$ kubectl describe httpproxy global | grep Message
      Message:             Spec.VirtualHost.TLS Secret "newtls" is invalid: Secret not found
    Message:               At least one error present, see Errors for details
$ kubectl get secret "newtls"
NAME     TYPE                DATA   AGE
newtls   kubernetes.io/tls   2      3h24m

According to #3496 this can also be because the Secret is considered invalid but I don't see anything obviously wrong with it:

$ kubectl describe secret "newtls"
Name:         newtls
Namespace:    ...
Labels:       <none>
Annotations:  cert-manager.io/alt-names:
 ...
              cert-manager.io/certificate-name: global
              cert-manager.io/common-name: ...
              cert-manager.io/ip-sans:
              cert-manager.io/issuer-group: cert-manager.io
              cert-manager.io/issuer-kind: ClusterIssuer
              cert-manager.io/issuer-name: ...
              cert-manager.io/uri-sans:

Type:  kubernetes.io/tls

Data
====
tls.crt:  6245 bytes
tls.key:  1679 bytes

$ kubectl get secret newtls -o json | jq -r '.data["tls.crt"]' | base64 -d | showcerts2 /dev/stdin
---
X.509v3 TLS Certificate (RSA 2048) [Serial: 6690...9970]
  ...
  Valid from:  2022-06-23T21:40:49Z
          to:  2024-06-22T21:40:49Z
---
X.509v3 Intermediate CA Certificate (RSA 2048) [Serial: 1605...1715]
  ...
  Valid from:  2015-10-07T15:22:23Z
          to:  2025-10-07T15:32:23Z
---
X.509v3 Root CA Certificate (RSA 4096) [Serial: 2177...9169]
  ...
  Valid from:  2015-10-07T14:06:52Z
          to:  2035-10-07T14:11:56Z

Contour logs:

$ stern contour- | grep -i newtls
+ contour-786b48f464-hz9h4 › contour
+ contour-786b48f464-hfppz › contour
contour-786b48f464-hfppz contour time="2022-06-23T21:50:56Z" level=error msg="invalid TLS certificate: failed to locate certificate" context=KubernetesCache kind=Secret name=newtls namespace=... version=v1
contour-786b48f464-hfppz contour time="2022-06-24T09:17:25Z" level=error msg="stream terminated" connection=1114 context=xds error="context canceled" node_id=envoy-d6845759f-bxjrl node_version=v1.22.0 resource_names="[.../newtls/ced75e6a4e]" response_nonce=1634 type_url=type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret version_info=1634

tsaarni commented 2 years ago

Can you share tls.crt here as attachment? Somehow the validation fails on that since it prints failed to locate certificate.

erwbgy commented 2 years ago

I can't share that specific tls.crt but after restarting the Contour Deployment it validated fine. The common Name and subjectAltNames were both set and the certificate was valid. Is there anything else that I need to look for?

tsaarni commented 2 years ago

I was searching cert-manager issues to find out if certificate could be missing during the certificate enrollment, but it seems now cert-manager should create the secret only after enrollment succeeds.

Just in case, I tested creating secret with empty tls.crt file, which results in failed to locate certificate as expected. Updating the certificate back to the secret did trigger the HTTPProxy status update and everything worked again. So, it seems the problem is not that simple.

youngnick commented 2 years ago

tbh it sounds like we might have a problem with caches going stale somewhere. If this happens again, could you see if making some minor change to the Secret (like adding a dummy annotation) fixes it?

We made some changes a while ago to be more aggressive about dropping Secrets from the caches, this will help us see if there's a bug there. (Poking the Secret should generate a change event from the apiserver, and if Contour has stale info for it, then that should result in it being fixed. If poking the Secret doesn't do anything, then it's not that.)

erwbgy commented 2 years ago

Thanks @youngnick, will do.

erwbgy commented 2 years ago

We just had a live one:

$ kubectl get httpproxy
NAME         FQDN                       TLS SECRET  STATUS      STATUS DESCRIPTION
app          app.ns.domain.com          app-tls     invalid     At least one error present, see Errors for details
app-alias    preprod.app.domain.com     app-tls     invalid     At least one error present, see Errors for details
prometheus   prometheus.ns.domain.com               valid       Valid HTTPProxy

$ kubectl get secret app-tls
NAME      TYPE                DATA   AGE
app-tls   kubernetes.io/tls   2      50m

Adding an annotation and label to the Secret did not make any difference so after 5 minutes we restarted the contour Deployment and the HTTPProxy objects immediately became valid:

$ kubectl get httpproxy
NAME         FQDN                      TLS SECRET  STATUS   STATUS DESCRIPTION
app          app.ns.domain.com         app-tls     valid    Valid HTTPProxy
app-alias    preprod.app.domain.com    app-tls     valid    Valid HTTPProxy
prometheus   prometheus.ns.domain.com              valid    Valid HTTPProxy

erwbgy commented 2 years ago

Key and certificate look okay:

$ kubectl get secret app-tls -o json | jq -r '.data["tls.key"]' | base64 -d | openssl rsa -in /dev/stdin -check -noout
RSA key ok

$ kubectl get secret app-tls -o json | jq -r '.data["tls.crt"]' | base64 -d | step certificate inspect /dev/stdin
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 646... (0x1dcw...)
    Signature Algorithm: SHA256-RSA
        Issuer: O=CA
        Validity
            Not Before: Jul 7 06:33:49 2022 UTC
            Not After : Jul 6 06:33:49 2024 UTC
        Subject: CN=preprod.app.domain.com
        Subject Public Key Info:
            Public Key Algorithm: RSA
                Public-Key: (2048 bit)
                Modulus:
                    af:fc:fc:d2:49:be:2b:df:c6:30:4f:16:a0:9a:50:
                    74:57:e3:24:7a:76:57:20:f6:31:52:62:d7:39:fa:
                    03:ce:04:14:ff:1c:8f:ff:84:de:3f:54:27:91:e0:
                    89:a4:6c:95:75:9e:b7:4d:65:6f:58:58:09:8c:06:
                    11:42:ae:a0:33:62:6f:7c:5a:c9:13:10:57:2d:6b:
                    a6:05:ae:6b:7b:38:bc:d4:a6:45:4b:1a:dd:85:d7:
                    9f:fb:bf:36:13:d7:f1:df:79:e9:70:13:9c:94:cd:
                    d3:94:7e:f0:60:c9:32:67:67:ca:e0:6f:9b:f0:d9:
                    dc:00:cf:03:1c:7d:88:48:68:48:ae:83:46:e5:fc:
                    f4:a5:d8:51:78:08:b8:d6:c5:8c:43:7a:aa:fd:2c:
                    fa:e9:e6:88:03:c6:48:bd:81:e2:31:e9:20:8a:70:
                    2a:72:f1:a4:78:b8:21:e4:03:d1:d7:cd:65:b4:52:
                    48:74:3e:8f:93:1a:32:76:2f:5d:c8:db:8c:48:b6:
                    c9:19:67:65:c1:7e:fc:c6:4f:c1:ee:e7:33:c7:3a:
                    92:1f:7a:4a:26:e3:76:69:91:40:c1:9d:29:0f:47:
                    67:f9:a7:5f:11:c8:44:5c:5f:63:1a:ad:ac:07:1e:
                    61:e9:14:08:fa:d1:d5:7c:d5:f1:7e:ab:f1:4b:f4:
                    f9
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Key Usage:
                Digital Signature, Key Encipherment
            X509v3 Subject Alternative Name:
                DNS:preprod.app.domain.com, DNS:app.ns.domain.com
            X509v3 Subject Key Identifier:
                98:7A:25:01:4E:89:A0:F3:35:E4:C1:E8:F0:23:BE:36:20:2E:F3:58
            X509v3 Authority Key Identifier:
                keyid:91:D1:46:82:DB:2A:92:F8:E4:39:27:78:16:36:11:F9:FB:0D:0B:16
            X509v3 CRL Distribution Points:
                Full Name:
                  URI:http://...
            Authority Information Access:
                OCSP - URI:http://...
                CA Issuers - URI:http://...
            1.3.6.1.4.1.311.21.7:
                0..&+.....7....C...b.ř=.......].0...n...:..d...
            X509v3 Extended Key Usage:
                Server Authentication
            1.3.6.1.4.1.311.21.10:
                0.0...+.......
    Signature Algorithm: SHA256-RSA
         96:2c:60:02:d9:f1:b0:89:ea:23:ae:b3:c7:d7:00:96:fe:d7:
         aa:b7:3f:ed:2c:c8:c6:79:9c:af:01:90:e0:7f:0b:bb:87:e5:
         ab:19:0c:a5:b7:49:12:46:5f:b0:ee:58:e3:9f:b0:75:dd:c3:
         96:e9:a0:33:d9:ca:86:7e:1e:21:e2:b5:21:a8:a9:1f:c9:bf:
         8d:ad:1d:98:23:38:98:aa:e6:65:e4:15:b7:a0:8b:a2:e5:25:
         a6:dd:15:4e:c2:5e:32:dd:96:31:d6:e2:6c:ed:07:48:38:78:
         73:d3:6f:35:ba:26:e7:69:4b:b6:b9:61:16:8c:7c:0e:50:01:
         b3:c8:57:e6:3b:8f:31:05:1a:be:78:13:eb:6a:b1:6e:82:ed:
         6c:7f:4f:d9:5d:27:9e:01:29:42:ea:ec:e0:a3:58:3d:a2:c2:
         5d:47:97:84:4b:6f:61:8d:7d:7e:2a:94:b7:35:56:e2:f5:35:
         6d:3a:5e:c1:f2:ea:62:2e:ae:f1:be:b1:0e:a4:e9:ef:4a:df:
         03:e8:e6:2c:30:25:a2:1b:1e:6d:bb:4c:c0:c4:89:8a:f2:3e:
         36:af:ce:3c:49:d1:18:c3:e6:9e:aa:d2:f3:39:f7:c0:13:6d:
         e0:79:e6:7f:b2:c5:3e:fa:b0:93:78:5e:9f:c0:a2:dc:bb:dd:
         a0:68:36:8a

erwbgy commented 2 years ago

Are we possibly being hit by kubernetes #109115?

If so, Contour v1.21.1 uses client-go 0.23.1 (16 Dec 2021) that does not include a fix for this (29 Mar 2022). Looks like we'd need to update to 0.23.8 or 0.24.2 (17 Jun 2022) - main has 0.24.1.

youngnick commented 2 years ago

That would certainly seem to fit with what you're seeing - if we bump main, can you check with that? I'd like to avoid cutting a 1.21.2 unless we are pretty sure that the client-go upgrade will fix the issue.

erwbgy commented 2 years ago

That would certainly seem to fit with what you're seeing - if we bump main, can you check with that?

Yes, we can certainly test with main.

I'd like to avoid cutting a 1.21.2 unless we are pretty sure that the client-go upgrade will fix the issue.

Makes sense.

erwbgy commented 2 years ago

Thanks for merging @skriss. Initial testing using the image from main indicates the issue is resolved, but we'll wait for several more hours of testing before declaring victory.

erwbgy commented 2 years ago

Yesterday I removed the Secret object referenced by an HTTPProxy and the status changed to invalid as expected:

$ kubectl get httpproxy explorer -o json | jq -r '.status.conditions'
[
  {
    "errors": [
      {
        "message": "Spec.VirtualHost.TLS Secret \"explorer-tls\" is invalid: Secret not found",
        "reason": "SecretNotValid",
        "status": "True",
        "type": "TLSError"
      }
    ],
    "lastTransitionTime": "2022-07-08T15:44:37Z",
    "message": "At least one error present, see Errors for details",
    "observedGeneration": 1,
    "reason": "ErrorPresent",
    "status": "False",
    "type": "Valid"
  }
]

Today I recreated the Secret object but that status has not updated:

$ kubectl get secret explorer-tls -o yaml | grep -P '\s+name:|^type:|creationTimestamp:'
  creationTimestamp: "2022-07-09T08:54:20Z"
  name: explorer-tls
type: kubernetes.io/tls

We're also seeing status updates on newly created HTTPProxy objects taking a really long time - for example:

time="2022-07-09T10:17:34Z" level=info msg="[...] PASS: waiting for HTTPProxy to become valid, took 4m30.005565285s"
time="2022-07-09T10:19:05Z" level=info msg="[...] PASS: waiting for HTTPProxy to become valid, took 1m40.005384188s"
time="2022-07-09T10:21:31Z" level=info msg="[...] PASS: waiting for HTTPProxy to become valid, took 2m10.004783598s"
time="2022-07-09T10:24:22Z" level=info msg="[...] PASS: waiting for HTTPProxy to become valid, took 3m0.006428992s"

erwbgy commented 2 years ago

Some more data points:

The expected status updates always occur when contour is restarted.
The cluster having the issue has 1197 HTTPProxy objects.
No difference after updating the cluster from Kubernetes 1.21.8 to 1.21.11.
No difference using a self-signed certificate or a certificate with a CA chain.
The issue where the Secret is created but the HTTPProxy does not get a status update appears more likely to happen if the HTTPProxy has been invalid for several hours.
Initially after a contour everything is updated and HTTPProxy status updates normally happen within 1 minute but over time status updates get slower or don't happen at all.
Sometimes newly created HTTPProxy objects do not get a status update within 3 minutes of being created.
When the HTTPProxy goes bad, deleting the HTTPProxy and recreating it still results in an invalid status complaining that the Secret object does not exist when it does.

erwbgy commented 2 years ago

Example of an HTTPProxy in a bad state that never recovers even if the HTTPProxy and Secret are deleted and recreated:

$ kubectl get httpproxy --no-headers
new-explorer           new-explorer.domain1   new-explorer-tls   invalid   At least one error present, see Errors for details
new-explorer-preprod   new-explorer.domain2   new-explorer-tls   invalid   At least one error present, see Errors for details

$ kubectl get secret new-explorer-tls
NAME               TYPE                DATA   AGE
new-explorer-tls   kubernetes.io/tls   3      14h

$ kubectl delete -f httpproxy-explorer.yaml
httpproxy.projectcontour.io "new-explorer" deleted

$ kubectl get httpproxy
NAME                   FQDN                   TLS SECRET         STATUS    STATUS DESCRIPTION
new-explorer-preprod   new-explorer.domain2   new-explorer-tls   invalid   At least one error present, see Errors for details

$ kubectl apply -f httpproxy-explorer.yaml
httpproxy.projectcontour.io/new-explorer created

$ kubectl get httpproxy --no-headers
new-explorer           new-explorer.domain1   new-explorer-tls   NotReconciled   Waiting for controller
new-explorer-preprod   new-explorer.domain2   new-explorer-tls   invalid         At least one error present, see Errors for details

$ kubectl get httpproxy --no-headers -w
new-explorer           new-explorer.domain1   new-explorer-tls   NotReconciled   Waiting for controller
new-explorer-preprod   new-explorer.domain2   new-explorer-tls   invalid         At least one error present, see Errors for details
new-explorer           new-explorer.domain1   new-explorer-tls   invalid         At least one error present, see Errors for details

$ kubectl delete -f certificate-explorer.yaml
certificate.cert-manager.io "new-explorer" deleted

$ kubectl delete secret new-explorer-tls
secret "new-explorer-tls" deleted

$ kubectl apply -f certificate-explorer.yaml
certificate.cert-manager.io/new-explorer created

$ kubectl get secret new-explorer-tls
NAME               TYPE                DATA   AGE
new-explorer-tls   kubernetes.io/tls   3      7s

$ kubectl get httpproxy --no-headers -w
new-explorer           new-explorer.domain1   new-explorer-tls   invalid   At least one error present, see Errors for details
new-explorer-preprod   new-explorer.domain2   new-explorer-tls   invalid   At least one error present, see Errors for details

$ kubectl describe httpproxy new-explorer | grep -w Message:
      Message:             Spec.VirtualHost.TLS Secret "new-explorer-tls" is invalid: Secret not found
    Message:               At least one error present, see Errors for details

Then after restarting the contour Deployment the older HTTPProxy status updates okay:

$ kubectl --as system:admin rollout restart deployment/contour
deployment.apps/contour restarted

$ kubectl rollout status deployment/contour
Waiting for deployment "contour" rollout to finish: 1 old replicas are pending termination...
Waiting for deployment "contour" rollout to finish: 1 old replicas are pending termination...
deployment "contour" successfully rolled out

$ kubectl get secret new-explorer-tls
NAME               TYPE                DATA   AGE
new-explorer-tls   kubernetes.io/tls   3      15m

$ kubectl get httpproxy -w
NAME                   FQDN                   TLS SECRET         STATUS    STATUS DESCRIPTION
new-explorer           new-explorer.domain1   new-explorer-tls   invalid   At least one error present, see Errors for details
new-explorer-preprod   new-explorer.domain2   new-explorer-tls   valid     Valid HTTPProxy

youngnick commented 2 years ago

It seems like there's something weird going on with Secret resources falling out of reconciliation and staying there. That's the only thing I can think of that would explain this. It's likely that the changes we made to more aggressively drop Secrets from the cache have something to do with it, but we'll need to review the code pretty closely in the absence of a reliable reproduction.

That's a long and indirect way of saying that one of us is going to have to go up to our eyeballs in that Secret handling code and see if we can figure out why some Secrets are being marked as definitely not relevant even across Kubernetes update events.

erwbgy commented 2 years ago

To try to rule out an issue like https://github.com/projectcontour/contour/pull/4266 I scaled down the number of contour replicas from our normal 3 to 1 and added --disable-leader-election command-line option. But as usual HTTPProxy status updates were really slow and eventually calls to HTTPProxy FQDNs started failing with status updates not happening at all.

While looking through the debug logs I also noticed that there were bursts of status updates for Ingress objects (not HTTPProxy where we have been seeing the issue) - for example:

time="2022-07-21T07:11:17Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:17Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:17Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:17Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:18Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:18Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:19Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:19Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:19Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:19Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:21Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:21Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234
time="2022-07-21T07:11:21Z" level=debug msg="received a status update" context=StatusUpdateHandler name=envoy-ingress namespace=n1234

And looking at the status field of this Ingress:

status:
  loadBalancer:
    ingress:
    - hostname: contour
    - hostname: router-default.apps.cluster.domain

We set "--ingress-status-address=contour" so the "hostname: contour" is expected and managedFields shows:

  - apiVersion: networking.k8s.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:loadBalancer:
          f:ingress: {}
    manager: openshift-controller-manager
    operation: Update
    time: "2022-07-21T07:17:28Z"

Watching this object I see the resourceVersion change continuously so both contour and openshift-controller-manager are modifying the object. The reason for the OpenShift Ingress controller making changes is that the Ingress object only has the projectcontour.io/ingress.class annotation and it is the default. Adding kubernetes.io/ingress.class stops the continuous updates. There were only 6 of these Ingress objects but they were generating a lot of updates so this should help.

erwbgy commented 2 years ago

Fixing the annotations to stop the two ingress controllers fighting over the Ingress status fields resulting in lots of updates to the status update channel resolved the issue.

In summary, no problem with Contour. Thank you to @tsaarni @youngnick, @skriss and others for looking into this.

youngnick commented 2 years ago

Wow, thanks for the update @erwbgy, that's really interesting. I have made a mental note to ask people to check the resourceVersion to see if it's changing rapidly in the future.

erwbgy commented 2 years ago

@youngnick The surprising thing here was that it was Ingress object updates that were causing the issue but it was HTTPProxy objects that were affected. I spent a long time looking at the latter :-)

youngnick commented 2 years ago

I think the cause here is that a lot of event churn, regardless of the source, can cause problems for Contour as it stands. It's definitely something that Envoy Gateway is keeping in mind as well.