ministryofjustice / cloud-platform

Documentation on the MoJ cloud platform
MIT License
84 stars 44 forks source link

Investigate FluentbitIsCrashLoopBackoffing alert #6060

Open timckt opened 3 weeks ago

timckt commented 3 weeks ago

Background

We are seeing FluentbitIsCrashLoopBackoffing keep coming up in the last few days.

Recent https://mojdt.slack.com/archives/C8QR5FQRX/p1724111122869289

Proposed user journey

Approach

Which part of the user docs does this impact

Communicate changes

Questions / Assumptions

Definition of done

Reference

How to write good user stories

timckt commented 3 weeks ago

https://docs.google.com/document/d/1FcnPtnxcpZDOiE9k5HTjbO3MHjvGJoH2TgPLO2rfGhI/edit

timckt commented 3 weeks ago

Update:

timckt commented 3 weeks ago

for the high conntrack in node 172.20.56.77 on 21 Aug ard 22:50 (grafana here), there were some liveness/readiness probe fail in kibana&_a=(columns:!(_source),filters:!(),index:'1f29f240-00eb-11ec-8a38-954e9fb3b0ba',interval:auto,query:(language:kuery,query:'%20%22ip-172-20-56-77.eu-west-2.compute.internal%22%20and%20(%22probe%20failed%22)'),sort:!())).

For fluentbit pod fluent-bit-c5pcg in node 172.20.56.77

but it didn't trigger fluentbit pod to restart as it didn't exceed the threshold which is 3 times.

        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /api/v1/health
            port: http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1