RedPanda nodes conntrack entries constantly increased in K8s

maksym-iv commented 3 months ago

Version & Environment

Redpanda version: tested on v23.3.3, v24.1.7, v24.1.9 TLS enabled Redpanda Operator (controller image): redpanda-operator:v2.1.10-23.2.18, redpanda-operator:v2.1.25-24.1.9 Helm chart: 5.8.8, replicas: 3 K8s: GKE

What went wrong?

I have deployed RedPanda via Helm chart and observe strange issue, conntrack entries on the node constantly increase. After a bit of debug I've found that most of the connections are done from other RedPanda pods to the 9644 (admin) port. The state of the connections in conntrack table is ESTABLISHED, short example:

``` tcp 6 86391 ESTABLISHED src=10.104.23.5 dst=10.104.17.5 sport=37702 dport=9644 src=10.104.17.5 dst=10.104.23.5 sport=9644 dport=37702 [ASSURED] mark=0 use=1 tcp 6 86385 ESTABLISHED src=10.104.22.5 dst=10.104.17.5 sport=56562 dport=9644 src=10.104.17.5 dst=10.104.22.5 sport=9644 dport=56562 [ASSURED] mark=0 use=1 tcp 6 86389 ESTABLISHED src=10.104.23.5 dst=10.104.17.5 sport=51330 dport=9644 src=10.104.17.5 dst=10.104.23.5 sport=9644 dport=51330 [ASSURED] mark=0 use=1 tcp 6 86396 ESTABLISHED src=10.104.17.5 dst=10.104.23.5 sport=56806 dport=9644 src=10.104.23.5 dst=10.104.17.5 sport=9644 dport=56806 [ASSURED] mark=0 use=1 tcp 6 86396 ESTABLISHED src=10.104.17.5 dst=10.104.22.5 sport=49420 dport=9644 src=10.104.22.5 dst=10.104.17.5 sport=9644 dport=49420 [ASSURED] mark=0 use=1 tcp 6 86392 ESTABLISHED src=10.104.17.5 dst=10.104.22.5 sport=48348 dport=9644 src=10.104.22.5 dst=10.104.17.5 sport=9644 dport=48348 [ASSURED] mark=0 use=1 tcp 6 86395 ESTABLISHED src=10.104.17.5 dst=10.104.22.5 sport=60952 dport=9644 src=10.104.22.5 dst=10.104.17.5 sport=9644 dport=60952 [ASSURED] mark=0 use=1 tcp 6 86390 ESTABLISHED src=10.104.22.5 dst=10.104.17.5 sport=40118 dport=9644 src=10.104.17.5 dst=10.104.22.5 sport=9644 dport=40118 [ASSURED] mark=0 use=1 tcp 6 86396 ESTABLISHED src=10.104.17.5 dst=10.104.22.5 sport=49598 dport=9644 src=10.104.22.5 dst=10.104.17.5 sport=9644 dport=49598 [ASSURED] mark=0 use=1 tcp 6 86393 ESTABLISHED src=10.104.17.5 dst=10.104.23.5 sport=50504 dport=9644 src=10.104.23.5 dst=10.104.17.5 sport=9644 dport=50504 [ASSURED] mark=0 use=1 tcp 6 86399 ESTABLISHED src=10.104.17.5 dst=10.104.23.5 sport=53916 dport=9644 src=10.104.23.5 dst=10.104.17.5 sport=9644 dport=53916 [ASSURED] mark=0 use=1 tcp 6 86394 ESTABLISHED src=10.104.17.5 dst=10.104.22.5 sport=59580 dport=9644 src=10.104.22.5 dst=10.104.17.5 sport=9644 dport=59580 [ASSURED] mark=0 use=1 tcp 6 86398 ESTABLISHED src=10.104.17.5 dst=10.104.22.5 sport=51476 dport=9644 src=10.104.22.5 dst=10.104.17.5 sport=9644 dport=51476 [ASSURED] mark=0 use=1 tcp 6 86390 ESTABLISHED src=10.104.17.5 dst=10.104.23.5 sport=55314 dport=9644 src=10.104.23.5 dst=10.104.17.5 sport=9644 dport=55314 [ASSURED] mark=0 use=1 tcp 6 86388 ESTABLISHED src=10.104.17.5 dst=10.104.23.5 sport=46608 dport=9644 src=10.104.23.5 dst=10.104.17.5 sport=9644 dport=46608 [ASSURED] mark=0 use=2 ```

Graph for the node_nf_conntrack_entries on RedPanda nodes looks like this

I've noticed the issue when RedPanda went down with "Connection error" messages, node_nf_conntrack_entries graph looked like this when RedPanda went down (simple restart of the pods helped of course)

Additionally, after 6 days of not restarting the RedPanda there are 37338 connections in established state to the 9644 port on one of the nodes (haven't checked other nodes, but pretty sure amount is roughly same).

Side note

Since I've noticed the issue a while ago and multiple things were changed in the RedPanda cluster itself, like:

Change the PodMonitoring resource (the GKE analogue of the Prometheus ServiceMonitor
Enablement of the Cillium network policies
...

I've reverted all, no change. But one, I've forgot about. statefulset.sideCars.controllers.enabled was also changed to true After disabling the sidecars controllers the conntrack entries came back to normal. Additionally with DEBUG log enabled on RedPanda I've noticed /v1/cluster/health_overview requests every few seconds, which have vanished after disabling the sidecar. I suspect that operator image may create KeepAlive connections and not re-using them, or RedPanda admin component may force KeepAlive with predefined timeout, whilst Operator (controller sidecar image) not honoring the KeepAlive, however not sure about it since haven't dived in to the operator code itself. node_nf_conntrack_entries graph before/after sidecar disable

Note, tested with multiple RedPanda and operator (controller) versions, written in Version \& Environment section

What should have happened instead?

Conntrack table is not constantly filling up with statefulset.sideCars.controllers.enabled: true

How to reproduce the issue?

Run RedPanda via helm chart
Enable controllers
- statefulset.sideCars.controllers.enabled: true
- statefulset.sideCars.controllers.run: ["decommission"]

Additional information

N/A, attached in previous sections

JIRA Link: CORE-5618

dotnwat commented 3 months ago

Thanks! Pinging someone about this.

github-actions[bot] commented 4 days ago

This issue hasn't seen activity in 3 months. If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.

redpanda-data / redpanda