Open swiatekm opened 1 year ago
I managed to reproduce this in a test K8s cluster. I've also written some benchmarks which demonstrate the problem, although this is somewhat challenging for reasons which I'll elaborate upon.
The test cluster had a synthetic workload generating 10k log lines per second running. The resource consumption as reported by kubectl top
was, respectively:
otelcol-logs-collector-5mdw2 593m 79Mi
otelcol-logs-collector-4jc6b 526m 469Mi
Nothing about the code for confighttp indicates why there would be such a large difference, and why the zstd encoder would allocate so much memory.
I believe the root cause is a combination of the zstd encoder allocating a fair amount of memory by default, and our pooling mechanism for encoder just not working as expected. We put encoders in a sync.Pool
, but in most practical circumstances, we don't make requests frequently enough to keep the pool hot. As a result, we create a new encoder for each request, and the zstd encoder handles this particularly poorly.
I've sketched out and tested a solution using a different pooling mechanism here: https://github.com/open-telemetry/opentelemetry-collector/compare/main...swiatekm-sumo:opentelemetry-collector:fix/zstd-encoder-pooling. With this change, the memory usage is reasonable again.
Would you like to submit a PR to fix the issue?
@atoulme I'll try when I get the time. I still don't understand why the problem is as severe as it is. My fix is effective, so I must be roughly correct about the root cause, but I'd like to try to understand it better before submitting.
@swiatekm-sumo @atoulme here are some of my findings on this. I deployed the otel collector on a k8s cluster today with atleast 50 nginx pods generating a new log every second. The difference seen in the memory usage with both the compression types is below:
sumo-sumologic-otelcol-logs-collector-vqp5z 124m 74Mi
sumo-sumologic-otelcol-logs-collector-4r92s 138m 106Mi
The difference in memory usage wasn't 10x as seen previously. I used 0.94.0
for testing. I let the collector run for atleast an hour.
@swiatekm-sumo @atoulme
Here are more findings and results from tests
v0.92.0
zstd: sumo-sumologic-otelcol-logs-collector-2lk4k 245m 250Mi
gzip: sumo-sumologic-otelcol-logs-collector-x2wmp 291m 84Mi
v0.94.0
gzip: sumo-sumologic-otelcol-logs-collector-djwmz 199m 73Mi
zstd: sumo-sumologic-otelcol-logs-collector-v8pjr 198m 118Mi
There seems to be some improvement in 0.94.0
which used a later version (v0.17.5) of the compress
library that has some zstd
changes.
https://github.com/klauspost/compress/releases/tag/v1.17.5
Apart from the above, there was a known memory leak with using zstd
with a sync.Pool
that was addressed in v0.15.0
by adding the option to disable concurrency which does not allow spawning new goroutines as explained here
A possible change we could make here is to simply use zstd.WithDecoderConcurrency(1)
as mentioned in this PR, which prevents goroutine leaks
https://github.com/segmentio/kafka-go/pull/889
Thank you for this inquiry and providing data, I appreciate it. Do we have appropriate benchmarks for zstd usage that we could use to test the simple change you mention?
I can work on the benchmarks to test the zstd compression with and without concurrency enabled. I do want to emphasize that the zstd
memory usage in v1.17.5
of the compress
package seems to be lesser than in v0.17.4
from running tests on a k8s cluster.
Ideally we'd have a benchmark showing the difference, though from trying to create one myself, this may not be so easy to do. The behaviour is timing-sensitive due to the use of sync.Pool
.
Created the following draft PR for this, to show the difference in memory allocation with concurrency disabled - https://github.com/open-telemetry/opentelemetry-collector/pull/9749
Resolved in #9749
Just to add: this issue appears to be still an issue in the latest version even with the above MR for server side HTTP receivers:
I'm seeing constant crash loops having zstd enabled with almost all memory associated with zstd codepaths.
@djluck that's strange given that the server side appears to already have our client-side fix implemented. Can you open a new issue with your findings?
Sure thing @swiatekm
Describe the bug We've added zstd support to the server side of confighttp in #7927. I rolled this out for OTLP traffic between an agent and a gateway in a K8s environment and saw very significant increase in memory consumption on the client side.
Steps to reproduce I have some memory profiles saved of this, and I'm planning to create a self-contained reproduction, ideally with just synthetic benchmarks.
What did you expect to see? Memory consumption in the ballpark of what other compression methods use.
What did you see instead? Memory consumption for otlphttp exporter using zstd more than 10x the amount for gzip.
What version did you use? Version: 0.82.0
What config did you use? The relevant part:
Environment Kubernetes 1.24, EKS to be precise.
Additional context I'm reporting this as is so it doesn't get lost and to consolidate reports in case other users experience this. Will update with more data once I'm able to.