I have deployed RedPanda via Helm chart and observe strange issue, conntrack entries on the node constantly increase.
After a bit of debug I've found that most of the connections are done from other RedPanda pods to the 9644 (admin) port. The state of the connections in conntrack table is ESTABLISHED, short example:
Graph for the node_nf_conntrack_entries on RedPanda nodes looks like this
I've noticed the issue when RedPanda went down with "Connection error" messages, node_nf_conntrack_entries graph looked like this when RedPanda went down (simple restart of the pods helped of course)
Additionally, after 6 days of not restarting the RedPanda there are 37338 connections in established state to the 9644 port on one of the nodes (haven't checked other nodes, but pretty sure amount is roughly same).
Side note
Since I've noticed the issue a while ago and multiple things were changed in the RedPanda cluster itself, like:
Change the PodMonitoring resource (the GKE analogue of the Prometheus ServiceMonitor
Enablement of the Cillium network policies
...
I've reverted all, no change.
But one, I've forgot about. statefulset.sideCars.controllers.enabled was also changed to true
After disabling the sidecars controllers the conntrack entries came back to normal.
Additionally with DEBUG log enabled on RedPanda I've noticed /v1/cluster/health_overview requests every few seconds, which have vanished after disabling the sidecar.
I suspect that operator image may create KeepAlive connections and not re-using them, or RedPanda admin component may force KeepAlive with predefined timeout, whilst Operator (controller sidecar image) not honoring the KeepAlive, however not sure about it since haven't dived in to the operator code itself.
node_nf_conntrack_entries graph before/after sidecar disable
Note, tested with multiple RedPanda and operator (controller) versions, written in Version \& Environment section
What should have happened instead?
Conntrack table is not constantly filling up with statefulset.sideCars.controllers.enabled: true
This issue hasn't seen activity in 3 months. If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.
Version & Environment
Redpanda version: tested on
v23.3.3
,v24.1.7
,v24.1.9
TLS enabled Redpanda Operator (controller image):redpanda-operator:v2.1.10-23.2.18
,redpanda-operator:v2.1.25-24.1.9
Helm chart:5.8.8
, replicas: 3 K8s: GKEWhat went wrong?
I have deployed RedPanda via Helm chart and observe strange issue, conntrack entries on the node constantly increase. After a bit of debug I've found that most of the connections are done from other RedPanda pods to the 9644 (admin) port. The state of the connections in conntrack table is
ESTABLISHED
, short example:Graph for the
node_nf_conntrack_entries
on RedPanda nodes looks like thisI've noticed the issue when RedPanda went down with "Connection error" messages,
node_nf_conntrack_entries
graph looked like this when RedPanda went down (simple restart of the pods helped of course)Additionally, after 6 days of not restarting the RedPanda there are
37338
connections in established state to the 9644 port on one of the nodes (haven't checked other nodes, but pretty sure amount is roughly same).Side note
Since I've noticed the issue a while ago and multiple things were changed in the RedPanda cluster itself, like:
PodMonitoring
resource (the GKE analogue of the PrometheusServiceMonitor
I've reverted all, no change. But one, I've forgot about.
statefulset.sideCars.controllers.enabled
was also changed totrue
After disabling the sidecars controllers the conntrack entries came back to normal. Additionally withDEBUG
log enabled on RedPanda I've noticed/v1/cluster/health_overview
requests every few seconds, which have vanished after disabling the sidecar. I suspect that operator image may create KeepAlive connections and not re-using them, or RedPanda admin component may force KeepAlive with predefined timeout, whilst Operator (controller sidecar image) not honoring the KeepAlive, however not sure about it since haven't dived in to the operator code itself.node_nf_conntrack_entries
graph before/after sidecar disableNote, tested with multiple RedPanda and operator (controller) versions, written in Version \& Environment section
What should have happened instead?
Conntrack table is not constantly filling up with
statefulset.sideCars.controllers.enabled: true
How to reproduce the issue?
statefulset.sideCars.controllers.enabled: true
statefulset.sideCars.controllers.run: ["decommission"]
Additional information
N/A, attached in previous sections
JIRA Link: CORE-5618