Open timckt opened 3 weeks ago
Update:
fluent-bit-vr8jd
sit in node 172.20.118.19
When we narrow down the timeframe from 2024-08-19 21:00 to 2024-08-20 02:00, we can see some other pods are restarting on node 172.20.118.19
at the same time period in kibana&_a=(columns:!(_source),filters:!(),index:'1f29f240-00eb-11ec-8a38-954e9fb3b0ba',interval:auto,query:(language:kuery,query:'%20%22ip-172-20-118-19.eu-west-2.compute.internal%22%20and%20%22restarted%22'),sort:!()))
35 restart on 172.20.118.19 (where fluent-bit-vr8jd sit in)
172.20.118.19
, the pods restarted due to failed liveness probe
At this time range, it was High NF Conntrack in 172.20.118.19
(grafana)
Theory for triggering FluentbitIsCrashLoopBackoffing
for the high conntrack in node 172.20.56.77 on 21 Aug ard 22:50 (grafana here), there were some liveness/readiness probe fail in kibana&_a=(columns:!(_source),filters:!(),index:'1f29f240-00eb-11ec-8a38-954e9fb3b0ba',interval:auto,query:(language:kuery,query:'%20%22ip-172-20-56-77.eu-west-2.compute.internal%22%20and%20(%22probe%20failed%22)'),sort:!())).
For fluentbit pod fluent-bit-c5pcg
in node 172.20.56.77
there were 2 readiness probe fail (kibana&_a=(columns:!(_source),filters:!(),index:'1f29f240-00eb-11ec-8a38-954e9fb3b0ba',interval:auto,query:(language:kuery,query:'%20%22ip-172-20-56-77.eu-west-2.compute.internal%22%20and%20(%22fluent%22)%20and%20(%22readiness%22)'),sort:!())))
2 liveness probe fail (kibana&_a=(columns:!(_source),filters:!(),index:'1f29f240-00eb-11ec-8a38-954e9fb3b0ba',interval:auto,query:(language:kuery,query:'%20%22ip-172-20-56-77.eu-west-2.compute.internal%22%20and%20(%22fluent%22)%20and%20(%22liveness%22)'),sort:!())))
but it didn't trigger fluentbit pod to restart as it didn't exceed the threshold which is 3 times.
livenessProbe:
failureThreshold: 3
httpGet:
path: /
port: http
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
failureThreshold: 3
httpGet:
path: /api/v1/health
port: http
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
Background
We are seeing
FluentbitIsCrashLoopBackoffing
keep coming up in the last few days.Recent https://mojdt.slack.com/archives/C8QR5FQRX/p1724111122869289
Proposed user journey
Approach
Which part of the user docs does this impact
Communicate changes
Questions / Assumptions
Definition of done
Reference
How to write good user stories