istio / istio

Connect, secure, control, and observe services.
https://istio.io
Apache License 2.0
35.74k stars 7.7k forks source link

Memory leak in TCP services (kafka, presto, etc) #52374

Open owais opened 1 month ago

owais commented 1 month ago

Is this the right place to submit this?

Bug Description

We started rolling out istio more broadly and noticed the sidecar leaking memory for some TCP services such as kafka and presto.

We had previously experienced similar leaks and fixed them by disabling custom dimensions on TCP metrics. Ref: #48023 and #48028

image

I tried disabling all istio_ metrics using an envoy bootstrap override and drop any metric with an istio_ prefix using stats_matcher but so far doesn't look like it is helping much.

Attaching heap profile report profile001.pdf

Version

❯ istioctl version
client version: 1.20.6
control plane version: 1.18.7
data plane version: 1.18.7 (739 proxies)

I've also tested this with 1.20.6 in another lab cluster and seen the same behavior.

Additional Information

No response

owais commented 1 month ago

We still had two custom dimensions on one HTTP (requests_total) metric. A metric not even used by these services as it mainly uses TCP. I decided to try removing those metric dimensions from the sidecar anyway and although it did not fix the leak, the sidecar memory consumption dropped from close to 100% to around 20% for all kafka pods.

It is interesting that the sidecar freed so much memory when dropping custom dimensions for a service that wasn't even emitting the metric. This makes me think it wasn't custom dimensions but just some kind of internal reset of objects held on to by the stats system. My assumption is that any change to metrics config will result in similar drop again followed by a gradual increase.

My change:

image

Effect on sidecar memory:

image
owais commented 1 month ago

I also tried METRIC_ROTATION_INTERVAL to 6h and 10m in different clusters but it seemingly had no effect. I'd have expected it to release a bunch of objects resulting in memory usage dropping every 6 hours but I'm not seeing any such pattern. I'm still using telemetry v2 (1.18) so either this setting is not helping or telemetry v2 does not support it.

kyessenov commented 1 month ago

Interesting that the memory is held by OPENSSL objects, which suggests it's some TLS metadata cached somewhere.

owais commented 1 month ago

I don't think this is a bug in istio. I think this is a connection leak due to services not closing them properly. Still collecting data. Will share the report once it is conclusive.

kyessenov commented 1 month ago

The default buffers are pretty large in Istio AFAIR. I think if you have a bad / leaky application it'd make sense to reduce these buffers - the connection buffers, the flow control windows, and the inspector buffer sizes are all adjustable.

owais commented 1 month ago

Could you please point me to the documentations for these? Thanks

owais commented 1 month ago

For one of our services, I noticed that the proxy had a ton of (10s of thousands) of connections open and in CLOSE_WAIT state, and the number kept increasing over time. Setting ISTIO_META_IDLE_TIMEOUT to 1h (default) made a huge difference to that application and apparently fixed the leak. We had to disable the timeout earlier due to another issue but that is another story.

image

Are the buffer and control window options documented anywhere by the istio project?

aayush-harwani commented 4 weeks ago

@owais , does this really fix issues ? I am setting ISTIO_META_IDLE_TIMEOUT to 1s but still memory usage is high.

owais commented 4 weeks ago

If your issue is services leaking connections and as a result increasing memory then this will mitigate the issue in the sidecar. However, memory can increase consistently for any number of reasons. Also, high memory usage is very different from consistently increasing memory consumption. I'd suggest you open a new ticket if you think it is a bug or ask on Istio slack/discourse or stack overflow if you have questions or need support.