projectcontour / contour

Contour is a Kubernetes ingress controller using Envoy proxy.
https://projectcontour.io
Apache License 2.0
3.73k stars 681 forks source link

HTTPProxy Status updates slow when there are lots of objects #5001

Closed erwbgy closed 1 year ago

erwbgy commented 1 year ago

What steps did you take and what happened: HTTPProxy status takes >5 minutes (more than 10 minutes sometimes) to update In a multi-tenant cluster with ~1500 Namespaces, ~1500 HTTPProxy/Ingress, ~5000 Services and ~30000 Secrets. Of the Secrets ~6000 are not of type kubernetes.io/dockercfg or kubernetes.io/service-account-token.

HTTPProxy objects have the NotReconciled status with the Waiting for controller status message for a very long time. Interestingly the .status.loadBalancer.ingress.hostname field is set almost immediately,

What did you expect to happen: HTTPProxy status to be updated in less than 3 minutes.

Anything else you would like to add: We were running Contour 1.21.3 in this cluster until a week ago with no issues so seem to have hit some sort of threshold. Initially giving the Contour Pod more resources helped speed up the status updates but did not fix the issue. Switched to Contour 1.23.2 as newer versions always have performance improvements but we still see slow HTTPProxy updates.

Briefly tried switching to 1.24.0-rc1 which has several changes that looked like they may help (#4827, #4846, #4912, #4792) but we still saw slow HTTPProxy updates and the .status.loadBalancer.ingress.hostname field was not being updated on Ingress or HTTPProxy objects. Need to spend more time investigating this though.

Environment:

erwbgy commented 1 year ago

Deleting almost 700 unused Secret objects (created before cert-manager owner references were configured) didn't initially help but after restarting the Contour Deployment status updates started taking around 1 minute or less.

erwbgy commented 1 year ago

After an hour status update times sometimes spike up to 2-3 minutes, but normally still ~1 minute or less, so the number of Secret objects appears to be the trigger.

Interestingly the leader Contour Pod is using 59m and ~2GB of CPU whereas the other two Pods in the Deployment are using ~3500m and ~4GB, which is different to Contour 1.21.

sunjayBhatia commented 1 year ago

Interestingly the leader Contour Pod is using 59m and ~2GB of CPU whereas the other two Pods in the Deployment are using ~3500m and ~4GB, which is different to Contour 1.21.

hm, that could be related to how many Envoy instances are connected to each Contour server, you could use the grpc server metrics each Contour reports to see how many RPCs each is handling

sunjayBhatia commented 1 year ago

might be good to chart the rate of contour_eventhandler_operation_total (per kind/op) in this environment, to see what is changing and causing the status update queue to possibly be filling up

skriss commented 1 year ago

Briefly tried switching to 1.24.0-rc1 which has several changes that looked like they may help (https://github.com/projectcontour/contour/pull/4827, https://github.com/projectcontour/contour/pull/4846, https://github.com/projectcontour/contour/pull/4912, https://github.com/projectcontour/contour/pull/4792) but we still saw slow HTTPProxy updates and the .status.loadBalancer.ingress.hostname field was not being updated on Ingress or HTTPProxy objects. Need to spend more time investigating this though.

I would definitely expect some of these changes that are coming in 1.24 to help here, though hard to say how much. Will take a look at the potential issue with hostname not updating -- I'm definitely seeing ingress.ip being updated in my test cluster FWIW.

erwbgy commented 1 year ago

Briefly tried switching to 1.24.0-rc1 which has several changes that looked like they may help (#4827, #4846, #4912, #4792) but we still saw slow HTTPProxy updates and the .status.loadBalancer.ingress.hostname field was not being updated on Ingress or HTTPProxy objects. Need to spend more time investigating this though.

I would definitely expect some of these changes that are coming in 1.24 to help here, though hard to say how much. Will take a look at the potential issue with hostname not updating -- I'm definitely seeing ingress.ip being updated in my test cluster FWIW.

Thanks @skriss. The loadBalancer field updated just fine in my much smaller dev cluster so it does definitely happen. In 1.23 it updated regardless of the other status update slowness while on 1.24-rc1 appeared to be slower.

erwbgy commented 1 year ago

might be good to chart the rate of contour_eventhandler_operation_total (per kind/op) in this environment, to see what is changing and causing the status update queue to possibly be filling up

Thanks @sunjayBhatia, I'll take a look at this metric and the others I see are nicely documented at https://projectcontour.io/docs/v1.23.2/guides/prometheus/.

sunjayBhatia commented 1 year ago

might be good to chart the rate of contour_eventhandler_operation_total (per kind/op) in this environment, to see what is changing and causing the status update queue to possibly be filling up

Thanks @sunjayBhatia, I'll take a look at this metric and the others I see are nicely documented at https://projectcontour.io/docs/v1.23.2/guides/prometheus/.

these grpc metrics might be useful as well: https://github.com/grpc-ecosystem/go-grpc-prometheus/blob/82c243799c991a7d5859215fba44a81834a52a71/server_metrics.go#L30-L55

skriss commented 1 year ago

@erwbgy one thing we did today is merge https://github.com/projectcontour/contour/pull/5003 which will let you increase the Kubernetes API client QPS/burst. As mentioned in the PR, the client-go defaults are 5 QPS with a burst of 10, which is probably too low for a large cluster. Interestingly, controller-runtime (via a code path that we are not currently using) defaults these to 20/30. We'd definitely be interested to hear if increasing them helps out with your issue. The PR is merged into main and will be available in the upcoming 1.24.

erwbgy commented 1 year ago

Ok @skriss will try the main image and report back.

One thing I do notice is that the memory of the leader contour Pod grows (currently at 13GB) and after a while ingress status does not get updated within 10 minutes. Restarting contour resolves the issue for a few hours. The contour logs do show errors for some invalid Ingress objects that we haven't got users to fix yet, but many of these have been there for a long time, so perhaps not related.

erwbgy commented 1 year ago

Tried with ghcr.io/projectcontour/contour@sha256:63d97ce8fecc10c24a353083562ac3dd31abef1421df226d64c4b5d738a3bb8f but had to abandon the test as Ingress object status was not getting updated (after 10 minutes). Switching back to v1.23.2 the status was updated immediately though HTTPProxy status is still slow.

sunjayBhatia commented 1 year ago

had to abandon the test as Ingress object status was not getting updated (after 10 minutes)

could you describe in a little more detail what is happening? are you creating a new Ingress and not seeing status added, modifying an existing one, etc?

erwbgy commented 1 year ago

We have a node-check application that regularly creates a Pod on every node in the cluster, then a Service, Ingress and HTTPProxy to access it to perform various checks.

Before doing the checks via the Ingress we wait for the .status.loadBalancer[0].hostname field to be set. In this cluster having issues when we switch to the v1.24-rc1 or main image the Ingress status.loadbalancer field is not being updated (after 10 minutes), while with v1.23.2 this field is updated almost immediately.

Before doing the checks via the HTTPProxy we wait for its status to be Valid. In this cluster having issues the status takes much longer to be updated - elsewhere it takes a few seconds at most. Right now using v1.23.2 I'm seeing anywhere from 40 seconds to 10 minutes which is our max timeout.

sunjayBhatia commented 1 year ago

as a sanity check, when you switch to the newer version of Contour, could you check on the Lease objects to make sure there isn't any funny business around leader election? have you customized the leader election lease duration at all, or any other leader election parameters? only the leader contour instance writes status so if theres something going on with a leader not being elected etc. that could contribute

sunjayBhatia commented 1 year ago

when Contour starts up, you should see:

time="2023-01-26T21:00:33Z" level=info msg="Watching Service for Ingress status" envoy-service-name=envoy envoy-service-namespace=projectcontour
...
time="2023-01-26T21:00:33Z" level=info msg="attempting to acquire leader lease projectcontour/leader-elect...\n" caller="leaderelection.go:248" context=kubernetes

following that, in the leader:

time="2023-01-26T21:00:33Z" level=info msg="successfully acquired lease projectcontour/leader-elect\n" caller="leaderelection.go:258" context=kubernetes
time="2023-01-26T21:00:33Z" level=info msg="started status update handler" context=StatusUpdateHandler
time="2023-01-26T21:00:33Z" level=info msg="received a new address for status.loadBalancer" context=loadBalancerStatusWriter loadbalancer-address=172.18.255.200

would be interesting to see if there are any significant time gaps between these log lines

sunjayBhatia commented 1 year ago

You can also turn on debug logging to see if there is a possible delay actually writing status updates, if you see the debug log received a status update (w/ name + namespace in log fields) but that resource is not actually updated for a while, then the issue is likely in the actual write to the API server, if that log does not show up reasonably quickly after the resource is added/updated, then its likely the DAG rebuild process is the bottleneck

erwbgy commented 1 year ago

When deploying the new Contour version:

With both contour-1-23.2, 1.24-rc1/main the memory usage of the leader Contour Pod continues to grow over time, as high as 16Gi, which is what the requests.memory is now set to.

erwbgy commented 1 year ago

The contour requests.cpu was set to 16 with GOMAXPROCS=16. When I changed this back to requests.cpu=5 with GOMAXPROCS=5 to match the other clusters, ingress updates with 1.24-rc1/main are now very quick like with 1.23.2.

HTTPProxy objects are still stuck in NotReconciled state though (after waiting 10 minutes).

erwbgy commented 1 year ago

Downgrading to v1.22.3 shows a significant improvement for the first 10 minutes: Ingress status updates within 2 seconds and HTTPProxy updates in less than a minute with requests.CPU=5 and GOMAXPROCS=6. Then the HTTPProxy update time starts to increase until we get timeouts.

Using v1.22.0 is good for 30 minutes until Ingress status updates start timing out.

erwbgy commented 1 year ago

As the petunias said: "Oh no, not again".

This was a repeat of https://github.com/projectcontour/contour/issues/4608 with multiple ingress controllers both continuously updating an Ingress object status.loadBalancer field because it has no kubernetes.io/ingress.class annotation so the default ingress controller processes it but also has a projectcontour.io/ingress.class annotation so Contour processes it.

For anyone who tries to debug anything like this in future the --kubernetes-debug=6 contour command-line option really helped (as suggested above). I expected the _contour_eventhandler_operationtotal metric to be really high for Ingress Update but it does not include status updates (https://github.com/projectcontour/contour/issues/5007 will help with this and https://github.com/projectcontour/contour/issues/5005 to help rule out DAG issues).

Thank you @sunjayBhatia and @skriss for your help in diagnosing this and apologies for the repeat issue.

erwbgy commented 1 year ago

It is worth noting that even after fixing the annotations it required restarting contour to resolve the issue as there were thousands of pending PUT requests to update the Ingress status that were pending so no updates happened at all. There should probably be some logic to ensure that updates to one object cannot monopolise the entire update queue.