projectcontour / contour

Contour is a Kubernetes ingress controller using Envoy proxy.
https://projectcontour.io
Apache License 2.0
3.72k stars 678 forks source link

High memory after upgrading to Contour v1.2 #2306

Closed stevesloka closed 2 years ago

stevesloka commented 4 years ago

Reporting in k8s slack user saw a higher usage of memory after upgrading from Contour v1.1 --> v1.2:

image

(https://kubernetes.slack.com/archives/C8XRH2R4J/p1583162745109500)

//cc @bgagnon

davecheney commented 4 years ago

Is this the contour process or the envoy process?

bgagnon commented 4 years ago

Thanks for opening this issue for me @stevesloka!

@davecheney this is the contour process; we run Envoy in a separate DaemonSet.

davecheney commented 4 years ago

Thanks for confirming. What are the sizes of this contour install; how many services, secrets, ingress, ingressroutes (if used), and httpproxies (if used), are in scope for this contour?

bgagnon commented 4 years ago

I used some kubectl -o json and jq magic to come up with these:

Do do you think the orphaned TLS secrets could be at fault here? Is Contour keeping those in memory?

davecheney commented 4 years ago

Nice. Thanks for those numbers.

davecheney commented 4 years ago

How are you monitoring memory usage, is this from the metrics reported by contour's /mentrics endpoint?

bgagnon commented 4 years ago

The screenshot at the top of this thread is from cadvisor/kubelet. It includes memory from the contour container as well as a sidecar we deploy.

Here are the go_memstats_ metrics values for the two contour containers (one for each pod):

go_memstats_sys_bytes 637655288
go_memstats_sys_bytes 568054008
go_memstats_heap_sys_bytes 481263616
go_memstats_heap_sys_bytes 454492160

Our sidecars are problematic in themselves -- consuming 634Mi and 465Mi of memory (go_memstats_sys_bytes), respectively. The sidecars are simple xDS gRPC proxies.

We didn't touch these sidecars, but it's entirely possible they are experiencing memory bloat as a side effect of the Contour upgrade.

Unfortunately I don't have all the individual numbers prior to the upgrade. But now I think I should maybe rollback to 1.2.0 and see if the sidecar numbers have changed.

davecheney commented 4 years ago

Our sidecars are problematic in themselves

Can you tell me more about these sidecars?

bgagnon commented 4 years ago

The sidecars intercept xDS responses to inject an ALS configuration. This is our solution for #1691, which is not (yet) supported in Contour. It's also our goal to implement #1690 via this xDS proxy (this is not done yet).

The xDS proxy is a gRPC server in Go than implements the same services as Contour. All RPCs are pass-throughs, except for one, which swaps in the ALS config.

We have ongoing issues where we appear to be leaking goroutines for gRPC streams. If something changed on that front in Contour 1.2, it might have exacerbated our bug.

davecheney commented 4 years ago

Note that Contour 1.2.0 also upgraded Envoy to 1.13.0

davecheney commented 4 years ago

@bgagnon thank you for the information you provided. I'm trying to build a reproduction in my test lab now. Based on your comments that both contour and your gRPC sidecar have the same memory usage characteristics, my initial line of investigation will be to correlate gRPC streams with memory usage.

davecheney commented 4 years ago

@bgagnon I wanted to give a quick update. After a bit of screwing around where I managed to b0rk my GKE cluster, I've got a setup which resembles the environment you describe.

At the moment the rough numbers I have for the per envoy cost, that is one envoy as a client of a contour, is around 20-30mb (the numbers are rubbery because I'm observing a garbage collected process externally). This is gc_heap_alloc, not _sys_bytes btw.

Given you said you had 2 contours and 30 envoys, assuming a roughly equal distribution, that's 450mb in per envoy costs which is in the ballpark for the figures you've reported.

That's about all I have so far. I don't know if this usage is unexpected, but given that your sidecars -- which are going to be interposing on those gRPC conversations -- are showing memory usage in the same order of magnitude, my next area of investigation will be to try to determine a per gRPC stream cost.

davecheney commented 4 years ago

Some indicative numbers from a single contour servicing 30 envoys, 300 virtual hosts.

# HeapAlloc = 507501408
# HeapSys = 975273984
# HeapIdle = 413835264
# HeapInuse = 561438720
# HeapReleased = 354091008
# HeapObjects = 9893915 
stevesloka commented 4 years ago

@bgagnon hey just a quick follow-up on this, are you still seeing the memory issues? Sorry this seems to have gotten lost, but want to check in and verify how you're doing.

bgagnon commented 4 years ago

Our latest stats:

Contour memory consumption:

We've seen a drop in memory consumption when we scaled down Envoy from 20 pods to 3 pods. That seems to be the biggest impacting factor, though it was not linear.

Meanwhile, every other quantity (proxies, secrets, services) has increased without a big impact on Contour memory consumption.

Since our Contour is now so far behind, I think it's best if I report back after we've upgraded to the latest release.

skriss commented 2 years ago

Closing this out as this is obsolete.