stevehipwell / helm-charts

Helm chart repository.
MIT License
117 stars 68 forks source link

Aggregator Not Sending Logs to outputs After Running for a Few Hours #789

Closed Rmaabari closed 2 weeks ago

Rmaabari commented 1 year ago

Issue Description:

Problem: After deploying the Fluent Bit Aggregator Helm Chart and running it for a few hours, it stops sending logs to Elasticsearch and Syslog, which are the intended destinations for log forwarding.

Expected Behavior: The Fluent Bit Aggregator should consistently and reliably forward logs to the specified Elasticsearch and Syslog destinations as configured in the Helm Chart.

Steps to Reproduce:

Deploy Fluent Bit Aggregator using the provided Helm Chart. Monitor the log forwarding functionality for a few hours. Observe that log forwarding to Elasticsearch and Syslog ceases after a certain period. Actual Results: After an initial period of successful log forwarding, Fluent Bit Aggregator stops sending logs to Elasticsearch and Syslog without any apparent errors or warnings.

Environment Details:

Kubernetes Cluster Version: 1.26 Fluent Bit Agents Version: 2.1.8 Fluent Bit Aggregator Version: 2.1.9 Elasticsearch Version: 8.9

aggregator config:

[SERVICE]
    daemon false
    http_Port 2020
    http_listen 0.0.0.0
    http_server true
    log_level debug
    parsers_file /fluent-bit/etc/parsers.conf
    storage.metrics true
    storage.path /fluent-bit/data

[INPUT]
    name forward
    listen 0.0.0.0
    port 24224

[FILTER]
    Name rewrite_tag
    Match kube.*
    Rule $syslog ^(true)$ syslog.* true
    Emitter_Name re_emitted

[OUTPUT]
    Name syslog
    Match syslog.*
    Host $HOST
    Port 514
    Retry_Limit false
    Mode tcp
    Syslog_Format rfc5424
    Syslog_MaxSize 65536
    Syslog_Hostname_Key hostname
    Syslog_Appname_Key appname
    Syslog_Procid_Key procid
    Syslog_Msgid_Key msgid
    Syslog_SD_Key uls@0
    Syslog_Message_Key msg

[OUTPUT]
    Name es
    Match kube.*
    HTTP_User $USER
    HTTP_Passwd $PASS
    tls Off
    tls.verify Off
    Host elastic-elasticsearch
    Port 9200
    Retry_Limit False
    Trace_Error On
    Trace_Output Off
    Suppress_Type_Name On
    Replace_Dots On
    Buffer_Size False
    Logstash_Prefix logstash
    Logstash_Format On
    Index logstash
    Generate_ID     On
    Write_Operation upsert

[OUTPUT]
    Name es
    Match host.*
    HTTP_User $USER
    HTTP_Passwd $PASS
    tls Off
    tls.verify Off
    Host elastic-elasticsearch
    Port 9200
    Retry_Limit False
    Trace_Error On
    Trace_Output Off
    Suppress_Type_Name On
    Replace_Dots On
    Buffer_Size False
    Logstash_Prefix logstash
    Logstash_Format On
    Index logstash
    Write_Operation upsert
    Generate_ID     On

fluent-bit agents config:

custom_parsers.conf:
----
[PARSER]
    Name docker_no_time
    Format json
    Time_Keep Off
    Time_Key time
    Time_Format %Y-%m-%dT%H:%M:%S.%L

[FILTER]
    Name    grep
    Match   *
    Exclude log liveness

[FILTER]
    Name    grep
    Match   *
    Exclude log readiness

[SERVICE]
    Daemon Off
    Flush 5
    Log_Level debug
    Parsers_File /fluent-bit/etc/parsers.conf
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_Port 2020
    Health_Check On

[INPUT]
    Name tail
    Path /var/log/containers/*.log
    Exclude_Path      /var/log/containers/*_monitoring_*.log
    multiline.parser docker, cri
    Tag kube.*
    Mem_Buf_Limit 50MB
    Buffer_Max_Size 1MB
    Skip_Long_Lines Off

[INPUT]
    Name systemd
    Tag host.*
    Systemd_Filter _SYSTEMD_UNIT=kubelet.service
    Read_From_Tail On

[FILTER]
    Name kubernetes
    Match kube.*
    Merge_Log On
    Keep_Log Off
    K8S-Logging.Parser On
    K8S-Logging.Exclude On

[OUTPUT]
    Name    forward
    Match   *
    Host    fluent-bit-aggregator
    Port    24224
stevehipwell commented 1 year ago

@Rmaabari this repo just hosts Helm charts and the Fluent Bit Aggregator chart is a convenient way to run Fluent Bit as a StatefulSet. Your actual configuration is input into the chart and isn't part of the chart logic.

If you're having trouble with Fluent Bit, have turned on debug logs and think there is an issue your best course of action would be to look at the existing issues and if none match open a new issue at fluent/fluent-bit.

Rmaabari commented 1 year ago

Hi @stevehipwell, thanks for your replay. I am using fluent-bit agents using the original fluent-bit (DaemonSet) helm chart repo, and using your aggregator helm chart in StatefulSet.

in regards to logs, they seem like nothing unsual attaching some of the logs:

[2023/09/17 12:07:58] [debug] [out flush] cb_destroy coro_id=7942
[2023/09/17 12:07:58] [debug] [retry] re-using retry for task_id=1959 attempts=19
[2023/09/17 12:07:58] [ warn] [engine] failed to flush chunk '1-1694939682.183824748.flb', retry in 1069 seconds: task_id=1959, input=forward.0 > output=es.1 (out_id=1)
[2023/09/17 12:07:59] [debug] [output:es:es.1] task_id=1354 assigned to thread #1
[2023/09/17 12:07:59] [debug] [output:es:es.1] task_id=1642 assigned to thread #0
[2023/09/17 12:07:59] [debug] [output:es:es.1] task_id=685 assigned to thread #1
[2023/09/17 12:07:59] [debug] [upstream] KA connection #96 to elastic-elasticsearch:9200 has been assigned (recycled)
[2023/09/17 12:07:59] [debug] [upstream] KA connection #91 to elastic-elasticsearch:9200 has been assigned (recycled)
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [http_client] not using http_proxy for header
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [http_client] not using http_proxy for header
[2023/09/17 12:07:59] [debug] [upstream] KA connection #89 to elastic-elasticsearch:9200 has been assigned (recycled)
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [out_es] converted_size is 0
[2023/09/17 12:07:59] [debug] [http_client] not using http_proxy for header
stevehipwell commented 1 year ago

@Rmaabari the interesting logs would be from when the output to ES failed. But unless it's caused by a defect in the chart you're going to need to open an issue on the Fluent Bit repo to figure out if this is a bug or a configuration issue.

If you provide me the chart values you used and the steps to resolve a failure I can take a look. Also if you lose logs as part of this?

Have you checked the logs on the ES side to see if there is an issue there? If ES is erroring and FB has no persistence a restart fixing the issue would indicate that there is an issue with the configuration and/or log content.

Rmaabari commented 1 year ago

@stevehipwell thanks again for the response! I will gladly supply you with the helm chart values.

    values:
      service:
        type: NodePort
        annotations: {}
        httpPort: 2020
        additionalPorts:
          - name: http-forward
            port: 24224
            containerPort: 24224
            protocol: TCP
      config:
        log_level: debug
        http_listen: "0.0.0.0"
        pipeline: |-
          [INPUT]
              name forward
              listen 0.0.0.0
              port 24224

          [FILTER]
              Name rewrite_tag
              Match kube.*
              Rule $syslog ^(true)$ syslog.* false
              Emitter_Name re_emitted

          [OUTPUT]
              Name syslog
              Match syslog.*
              Host $SYSLOG_SERVER
              Port 514
              Retry_Limit false
              Mode tcp
              Syslog_Format rfc5424
              Syslog_MaxSize 65536
              Syslog_Hostname_Key hostname
              Syslog_Appname_Key appname
              Syslog_Procid_Key procid
              Syslog_Msgid_Key msgid
              Syslog_SD_Key uls@0
              Syslog_Message_Key msg

          [OUTPUT]
              Name es
              Match kube.*
              HTTP_User $USER
              HTTP_Passwd $PASS
              tls Off
              tls.verify Off
              Host elastic-elasticsearch
              Port 9200
              Retry_Limit False
              Trace_Error On
              Trace_Output Off
              Suppress_Type_Name On
              Replace_Dots On
              Buffer_Size False
              Logstash_Prefix logstash
              Logstash_Format On
              Index logstash

          [OUTPUT]
              Name es
              Match host.*
              HTTP_User $USER
              HTTP_Passwd $PASS
              tls Off
              tls.verify Off
              Host elastic-elasticsearch
              Port 9200
              Retry_Limit False
              Trace_Error On
              Trace_Output Off
              Suppress_Type_Name On
              Replace_Dots On
              Buffer_Size False
              Logstash_Prefix logstash
              Logstash_Format On
              Index logstash

Since the log level is set to debug, I am unable to pinpoint exactly when logs ceased being sent to elastic. I have observed, however, that after a couple of hours without logs being sent to elastic, a very small number of logs are sent for a single minute (around 20 documents), and none of these logs have a K8S filter, and again no logs being sent.

The only thing that resolves this issue is restarting the statefulset, resulting in logs being sent to all expected outputs.

I will also submit an issue to the fluentbit original helm git repository.

Rmaabari commented 1 year ago

here is a screen shot of the logs in kibana view image

stevehipwell commented 1 year ago

@Rmaabari how have you configured the persistence?

I'm not sure your ES output configuration is correct, it looks like you're not constraining retries and the buffer?

I'm currently on annual leave so can't get everything on a screen to review how you've got this set up. Please add a link to the FB issue you open in this issue.

stevehipwell commented 2 months ago

@Rmaabari is this still an issue or did you manage to resolve it?