ministryofjustice / cloud-platform

Documentation on the MoJ cloud platform
MIT License
87 stars 44 forks source link

Upgrade fluent-bit #5449

Open sj-williams opened 5 months ago

sj-williams commented 5 months ago

Background

There is a new major release of fluent-bit available, we are currently on version 2.2.1

https://fluentbit.io/announcements/v3.0.0/

https://fluentbit.io/announcements/v3.0.1/

Associated Helm release is v0.46.1: https://github.com/fluent/helm-charts/blob/fluent-bit-0.46.1/charts/fluent-bit/Chart.yaml#L8-L9

Approach

Fluent don't publish specifics regarding Kubernetes version compatibility. Release / upgrades notes make no mention of breaking changes. However this is a major release with new functionality added, so thorough checking of changelogs and testing is important.

Which part of the user docs does this impact

Communicate changes

Questions / Assumptions

Definition of done

Reference

How to write good user stories

sj-williams commented 4 months ago

logging: https://github.com/ministryofjustice/cloud-platform-terraform-logging/tree/bump-helm-version

ingress: https://github.com/ministryofjustice/cloud-platform-terraform-ingress-controller/tree/bump-fluent-bit

sj-williams commented 4 months ago

upgrade applied, image runbook updated

sj-williams commented 4 months ago

Rolling back logging fluentbit via helm to 2.2.1 following frequent crashing after 3.0.2 upgrade (2-3 times daily)

[2024/05/16 08:26:03] [engine] caught signal (SIGSEGV)
#0  0x5580c05d5c49      in  flb_log_event_encoder_dynamic_field_flush_scopes() at src/flb_log_event_encoder_dynamic_field.c:210
#1  0x5580c05d5c49      in  flb_log_event_encoder_dynamic_field_reset() at src/flb_log_event_encoder_dynamic_field.c:240
#2  0x5580c05d3bac      in  flb_log_event_encoder_reset() at src/flb_log_event_encoder.c:33
#3  0x5580c0602f1f      in  ml_stream_buffer_flush() at plugins/in_tail/tail_file.c:418
#4  0x5580c0602f1f      in  ml_flush_callback() at plugins/in_tail/tail_file.c:919
#5  0x5580c05b8757      in  flb_ml_flush_stream_group() at src/multiline/flb_ml.c:1515
#6  0x5580c05b8eb5      in  flb_ml_flush_parser_instance() at src/multiline/flb_ml.c:117
#7  0x5580c05d6c1c      in  flb_ml_stream_id_destroy_all() at src/multiline/flb_ml_stream.c:316
#8  0x5580c06036ac      in  flb_tail_file_remove() at plugins/in_tail/tail_file.c:1249
#9  0x5580c05ff405      in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:242
#10 0x5580c0588894      in  flb_input_collector_fd() at src/flb_input.c:1949
#11 0x5580c05a2507      in  flb_engine_handle_event() at src/flb_engine.c:575
#12 0x5580c05a2507      in  flb_engine_start() at src/flb_engine.c:941
#13 0x5580c057e153      in  flb_lib_worker() at src/flb_lib.c:674
#14 0x7fdcebc0fea6      in  ???() at ???:0
#15 0x7fdceb4c3a6e      in  ???() at ???:0
#16 0xffffffffffffffff  in  ???() at ???:0

Theres an open issue on this one: https://github.com/fluent/fluent-bit/issues/8779

Keep an eye on updates here for any resolution