Open owais opened 3 months ago
We still had two custom dimensions on one HTTP (requests_total) metric. A metric not even used by these services as it mainly uses TCP. I decided to try removing those metric dimensions from the sidecar anyway and although it did not fix the leak, the sidecar memory consumption dropped from close to 100% to around 20% for all kafka pods.
It is interesting that the sidecar freed so much memory when dropping custom dimensions for a service that wasn't even emitting the metric. This makes me think it wasn't custom dimensions but just some kind of internal reset of objects held on to by the stats system. My assumption is that any change to metrics config will result in similar drop again followed by a gradual increase.
My change:
Effect on sidecar memory:
I also tried METRIC_ROTATION_INTERVAL
to 6h
and 10m
in different clusters but it seemingly had no effect. I'd have expected it to release a bunch of objects resulting in memory usage dropping every 6 hours but I'm not seeing any such pattern. I'm still using telemetry v2 (1.18) so either this setting is not helping or telemetry v2 does not support it.
Interesting that the memory is held by OPENSSL objects, which suggests it's some TLS metadata cached somewhere.
I don't think this is a bug in istio. I think this is a connection leak due to services not closing them properly. Still collecting data. Will share the report once it is conclusive.
The default buffers are pretty large in Istio AFAIR. I think if you have a bad / leaky application it'd make sense to reduce these buffers - the connection buffers, the flow control windows, and the inspector buffer sizes are all adjustable.
Could you please point me to the documentations for these? Thanks
For one of our services, I noticed that the proxy had a ton of (10s of thousands) of connections open and in CLOSE_WAIT state, and the number kept increasing over time. Setting ISTIO_META_IDLE_TIMEOUT to 1h (default) made a huge difference to that application and apparently fixed the leak. We had to disable the timeout earlier due to another issue but that is another story.
Are the buffer and control window options documented anywhere by the istio project?
@owais , does this really fix issues ? I am setting ISTIO_META_IDLE_TIMEOUT to 1s but still memory usage is high.
If your issue is services leaking connections and as a result increasing memory then this will mitigate the issue in the sidecar. However, memory can increase consistently for any number of reasons. Also, high memory usage is very different from consistently increasing memory consumption. I'd suggest you open a new ticket if you think it is a bug or ask on Istio slack/discourse or stack overflow if you have questions or need support.
🧭 This issue or pull request has been automatically marked as stale because it has not had activity from an Istio team member since 2024-08-01. It will be closed on 2024-11-14 unless an Istio team member takes action. Please see this wiki page for more information. Thank you for your contributions.
Created by the issue and PR lifecycle manager.
Is this the right place to submit this?
Bug Description
We started rolling out istio more broadly and noticed the sidecar leaking memory for some TCP services such as kafka and presto.
We had previously experienced similar leaks and fixed them by disabling custom dimensions on TCP metrics. Ref: #48023 and #48028
I tried disabling all
istio_
metrics using an envoy bootstrap override and drop any metric with anistio_
prefix usingstats_matcher
but so far doesn't look like it is helping much.Attaching heap profile report profile001.pdf
Version
I've also tested this with
1.20.6
in another lab cluster and seen the same behavior.Additional Information
No response