Kafka Broker - Cache and buffer memory on application node

strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes

https://strimzi.io/

Apache License 2.0

4.86k stars 1.3k forks source link

Kafka Broker - Cache and buffer memory on application node #767

Closed franjo-piskur closed 2 years ago

franjo-piskur commented 6 years ago

Hello,

We deployed Strimzi with 5 Kafka brokers. Over the time we realized that Brokers consumed all free memory on application node for caching. That cause some isseues with other pods running on same node. On OpenShift we dindn't see any event messages, but we realized that two brokers had different age than the others (this two were 1 day old and the others were 10 day old). After that we were troubleshooting and realized that in the momet when that 2 pods were recreated (not restarted) all memory was used with cache and slab cache.

Also some other pods which were running on that application node was recreated, and in prometheus for some period we got for a few minutes that that application node was down (we use node-exporter to get that metrics). Like it was unreachable from the pods on another application nodes.

Is there some way in Kafka to limit total cache/buffer memory withouts setting pod limits which will restart Kafka Brokers?

scholzj commented 6 years ago

I'm not sure what cache exactly do you have in mind. Is that the disk cache? Or some Kafka's internal cache?

franjo-piskur commented 6 years ago

I would say Kafka internal cache, because after i executed on application node:

echo 3 > /proc/sys/vm/drop_caches

i saw in hawkular metrics that memory usage on that broker significantly dropped (from 40+Gib to 8 Gib), and also that memory freed on application node.

I see that also some people have similar issues when running Kafka on Openshift:

https://stackoverflow.com/questions/45300985/openshift-resource-limit-and-pagecache

scholzj commented 6 years ago

Yeah, to be honest I'm not sure I can help with this. This is controlled by the operating system. It is a memory which is not assigned to any specific container, so you cannot really control it by the OpenShift memory requests. Normally, the kernel should automatically free it if this memory is needed by someone else. But it sometimes causes that applications fail to allocate new memory. That can probably cause some pod restarts.

In your case ... do you have any memory request / limit in the Kafka brokers and in other applications? Also for Kafka and other Java apps ... do you use -Xmx and -Xms options?

sgv0007 commented 5 months ago

@franjo-piskur Were you able to fix this issue ? Even I am having same problem , under large load kafka utilizes all the available memory on the node and pod is getting evicted and recreated .

Thanks, Shiva