Open JoshuaC215 opened 3 years ago
Hey @JoshKCarroll, thanks for digging into this and for the great write up. I'll keep an eye on it.
We are also experiencing similar issues related to high memory usage and pods running out of memory. We too use zstd
compression.
@drurenia there's some good discussion in the linked issue above, I believe for us it partly had to do with producing from librdkafka and consuming from benthos (Sarama) and some mismatch between their implementations. Tweaking some of the kafka configurations around packet size definitely helped (this is important for the producer and consumer!). However we ended up switching back to snappy for the time being, as we couldn't find a way to eliminate the problem entirely and preferred the less efficient compression over the sporadic OOM crashes.
Some changes were getting pushed up into zstd and Sarama that could help, I don't know if those made it yet into Sarama or Benthos (nor do I know for sure if they would fix this issue).
@JoshKCarroll, the discussion going on in the linked issue is rather interesting indeed.
I am going to run some tests using snappy
instead of zstd
to verify that my issues are indeed related to zstd
and afterwards I will share my findings here.
Thanks a lot for all the info you've provided.
I'll make a note to upgrade sarama for the next release, it looks as thought they've put tagged releases out with an updated version of klauspost/compress
, so fingers crossed.
@Jeffail , that's good news. Thanks!
I can confirm now that my problem was indeed related to zstd
. With snappy
we have sane memory consumption and, most importantly, no OOM.
Thanks for the update @drurenia, I think since this quite a severe problem and we aren't necessarily sure we've got access to a fix yet it's probably worth me adding a note to the docs pointing to this issue.
We found recently that when we switched from compressing with snappy to compressing with zstd on our kafka topics, newer versions of benthos with
kafka_balanced
input started getting OOM when operating under a backlog.I believe this issue has to do with the sarama library and/or the zstd compression library it uses (which changed from older to newer benthos). I have opened a primary issue here with all the details: https://github.com/Shopify/sarama/issues/1831 but figured it was worth raising for visibility and in case anyone had insight.
Slightly related: I wanted to try tuning some of the sarama consumer fetch configurations in benthos to see if it made the issue go away (although now that I understand it better, I'm not sure it will). I could not figure out a way to do it without forking Benthos or writing a library that used Benthos with a modified Sarama consumer client. I wondered if you had any suggestion on how to do this, and/or if you would accept pull requests to expose more arguments like this if we need them down the road.
Thank you!