redpanda-data / connect

Fancy stream processing made operationally mundane
https://docs.redpanda.com/redpanda-connect/about/
8.09k stars 820 forks source link

Extremely high memory usage for `kafka_balanced` input reading zstd #552

Open JoshuaC215 opened 3 years ago

JoshuaC215 commented 3 years ago

We found recently that when we switched from compressing with snappy to compressing with zstd on our kafka topics, newer versions of benthos with kafka_balanced input started getting OOM when operating under a backlog.

I believe this issue has to do with the sarama library and/or the zstd compression library it uses (which changed from older to newer benthos). I have opened a primary issue here with all the details: https://github.com/Shopify/sarama/issues/1831 but figured it was worth raising for visibility and in case anyone had insight.

Slightly related: I wanted to try tuning some of the sarama consumer fetch configurations in benthos to see if it made the issue go away (although now that I understand it better, I'm not sure it will). I could not figure out a way to do it without forking Benthos or writing a library that used Benthos with a modified Sarama consumer client. I wondered if you had any suggestion on how to do this, and/or if you would accept pull requests to expose more arguments like this if we need them down the road.

Thank you!

Jeffail commented 3 years ago

Hey @JoshKCarroll, thanks for digging into this and for the great write up. I'll keep an eye on it.

drurenia commented 3 years ago

We are also experiencing similar issues related to high memory usage and pods running out of memory. We too use zstd compression.

JoshuaC215 commented 3 years ago

@drurenia there's some good discussion in the linked issue above, I believe for us it partly had to do with producing from librdkafka and consuming from benthos (Sarama) and some mismatch between their implementations. Tweaking some of the kafka configurations around packet size definitely helped (this is important for the producer and consumer!). However we ended up switching back to snappy for the time being, as we couldn't find a way to eliminate the problem entirely and preferred the less efficient compression over the sporadic OOM crashes.

Some changes were getting pushed up into zstd and Sarama that could help, I don't know if those made it yet into Sarama or Benthos (nor do I know for sure if they would fix this issue).

drurenia commented 3 years ago

@JoshKCarroll, the discussion going on in the linked issue is rather interesting indeed.

I am going to run some tests using snappy instead of zstd to verify that my issues are indeed related to zstd and afterwards I will share my findings here.

Thanks a lot for all the info you've provided.

Jeffail commented 3 years ago

I'll make a note to upgrade sarama for the next release, it looks as thought they've put tagged releases out with an updated version of klauspost/compress, so fingers crossed.

drurenia commented 3 years ago

@Jeffail , that's good news. Thanks!

I can confirm now that my problem was indeed related to zstd. With snappy we have sane memory consumption and, most importantly, no OOM.

Jeffail commented 3 years ago

Thanks for the update @drurenia, I think since this quite a severe problem and we aren't necessarily sure we've got access to a fix yet it's probably worth me adding a note to the docs pointing to this issue.