Investigate FluentbitIsCrashLoopBackoffing alert

timckt commented 3 weeks ago

Background

We are seeing FluentbitIsCrashLoopBackoffing keep coming up in the last few days.

Recent https://mojdt.slack.com/archives/C8QR5FQRX/p1724111122869289

Proposed user journey

Approach

Which part of the user docs does this impact

Communicate changes

[ ] post for #cloud-platform-update
[ ] Weeknotes item
[ ] Show the Thing/P&A All Hands/User CoP
[ ] Announcements channel

Questions / Assumptions

Definition of done

[ ] readme has been updated
[ ] user docs have been updated
[ ] another team member has reviewed
[ ] smoke tests are green
[ ] prepare demo for the team

Reference

How to write good user stories

timckt commented 3 weeks ago

https://docs.google.com/document/d/1FcnPtnxcpZDOiE9k5HTjbO3MHjvGJoH2TgPLO2rfGhI/edit

timckt commented 3 weeks ago

Update:

fluent-bit-vr8jd sit in node 172.20.118.19
When we narrow down the timeframe from 2024-08-19 21:00 to 2024-08-20 02:00, we can see some other pods are restarting on node 172.20.118.19 at the same time period in kibana&_a=(columns:!(_source),filters:!(),index:'1f29f240-00eb-11ec-8a38-954e9fb3b0ba',interval:auto,query:(language:kuery,query:'%20%22ip-172-20-118-19.eu-west-2.compute.internal%22%20and%20%22restarted%22'),sort:!()))
35 restart on 172.20.118.19 (where fluent-bit-vr8jd sit in)
- In node 172.20.118.19, the pods restarted due to failed liveness probe
At this time range, it was High NF Conntrack in 172.20.118.19 (grafana)
Theory for triggering FluentbitIsCrashLoopBackoffing
- Node is having High Number Conntrack Entries
- When the connection tracking table is full, new connections might not be tracked properly, leading to network failures for those connections. This could cause issues like dropped connections, timeouts, or packet loss, which can impact the pods' ability to communicate.
- connection to liveness / readiness probe fail
- pod restart
- trigger FluentbitIsCrashLoopBackoffing
- so the root cause should be conntrack issue rather than fluentbit itself

timckt commented 3 weeks ago

for the high conntrack in node 172.20.56.77 on 21 Aug ard 22:50 (grafana here), there were some liveness/readiness probe fail in kibana&_a=(columns:!(_source),filters:!(),index:'1f29f240-00eb-11ec-8a38-954e9fb3b0ba',interval:auto,query:(language:kuery,query:'%20%22ip-172-20-56-77.eu-west-2.compute.internal%22%20and%20(%22probe%20failed%22)'),sort:!())).

For fluentbit pod fluent-bit-c5pcg in node 172.20.56.77

there were 2 readiness probe fail (kibana&_a=(columns:!(_source),filters:!(),index:'1f29f240-00eb-11ec-8a38-954e9fb3b0ba',interval:auto,query:(language:kuery,query:'%20%22ip-172-20-56-77.eu-west-2.compute.internal%22%20and%20(%22fluent%22)%20and%20(%22readiness%22)'),sort:!())))
2 liveness probe fail (kibana&_a=(columns:!(_source),filters:!(),index:'1f29f240-00eb-11ec-8a38-954e9fb3b0ba',interval:auto,query:(language:kuery,query:'%20%22ip-172-20-56-77.eu-west-2.compute.internal%22%20and%20(%22fluent%22)%20and%20(%22liveness%22)'),sort:!())))

but it didn't trigger fluentbit pod to restart as it didn't exceed the threshold which is 3 times.

        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /api/v1/health
            port: http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

ministryofjustice / cloud-platform