Memory leak in TCP services (kafka, presto, etc)

owais commented 3 months ago

Is this the right place to submit this?

[X] This is not a security vulnerability or a crashing bug
[X] This is not a question about how to use Istio

Bug Description

We started rolling out istio more broadly and noticed the sidecar leaking memory for some TCP services such as kafka and presto.

We had previously experienced similar leaks and fixed them by disabling custom dimensions on TCP metrics. Ref: #48023 and #48028

I tried disabling all istio_ metrics using an envoy bootstrap override and drop any metric with an istio_ prefix using stats_matcher but so far doesn't look like it is helping much.

Attaching heap profile report profile001.pdf

Version

❯ istioctl version
client version: 1.20.6
control plane version: 1.18.7
data plane version: 1.18.7 (739 proxies)

I've also tested this with 1.20.6 in another lab cluster and seen the same behavior.

Additional Information

No response

owais commented 3 months ago

We still had two custom dimensions on one HTTP (requests_total) metric. A metric not even used by these services as it mainly uses TCP. I decided to try removing those metric dimensions from the sidecar anyway and although it did not fix the leak, the sidecar memory consumption dropped from close to 100% to around 20% for all kafka pods.

It is interesting that the sidecar freed so much memory when dropping custom dimensions for a service that wasn't even emitting the metric. This makes me think it wasn't custom dimensions but just some kind of internal reset of objects held on to by the stats system. My assumption is that any change to metrics config will result in similar drop again followed by a gradual increase.

My change:

Effect on sidecar memory:

owais commented 3 months ago

I also tried METRIC_ROTATION_INTERVAL to 6h and 10m in different clusters but it seemingly had no effect. I'd have expected it to release a bunch of objects resulting in memory usage dropping every 6 hours but I'm not seeing any such pattern. I'm still using telemetry v2 (1.18) so either this setting is not helping or telemetry v2 does not support it.

kyessenov commented 3 months ago

Interesting that the memory is held by OPENSSL objects, which suggests it's some TLS metadata cached somewhere.

owais commented 3 months ago

I don't think this is a bug in istio. I think this is a connection leak due to services not closing them properly. Still collecting data. Will share the report once it is conclusive.

kyessenov commented 3 months ago

The default buffers are pretty large in Istio AFAIR. I think if you have a bad / leaky application it'd make sense to reduce these buffers - the connection buffers, the flow control windows, and the inspector buffer sizes are all adjustable.

owais commented 3 months ago

Could you please point me to the documentations for these? Thanks

owais commented 3 months ago

For one of our services, I noticed that the proxy had a ton of (10s of thousands) of connections open and in CLOSE_WAIT state, and the number kept increasing over time. Setting ISTIO_META_IDLE_TIMEOUT to 1h (default) made a huge difference to that application and apparently fixed the leak. We had to disable the timeout earlier due to another issue but that is another story.

Are the buffer and control window options documented anywhere by the istio project?

aayush-harwani commented 2 months ago

@owais , does this really fix issues ? I am setting ISTIO_META_IDLE_TIMEOUT to 1s but still memory usage is high.

owais commented 2 months ago

If your issue is services leaking connections and as a result increasing memory then this will mitigate the issue in the sidecar. However, memory can increase consistently for any number of reasons. Also, high memory usage is very different from consistently increasing memory consumption. I'd suggest you open a new ticket if you think it is a bug or ask on Istio slack/discourse or stack overflow if you have questions or need support.

istio-policy-bot commented 1 week ago

🧭 This issue or pull request has been automatically marked as stale because it has not had activity from an Istio team member since 2024-08-01. It will be closed on 2024-11-14 unless an Istio team member takes action. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

istio / istio