Fleuntd stops sending logs and overflowing the buffer or queued_limit_size

Problem

Fluentd is working fine for hours and then it gets one of the two, either the buffer total_limit_size get reached and fleuntd stops working (even after I have set overflow_action to drop_oldest_chunk) or the queued_chunks_limit_size gets reached and again fleuntd stops sending. I have tried a lot of different configuration including default ones and using memory as a buffer. In any of the cases I am facing one of the two issues. Using the above configuration (this is my latest test) I got queued_chunk_limit_size reached (over 8200 file with the .meta files). The only errors i see in logs are for slow_flush_threshold from time to time. During the failure I am not observing extensive memory or CPU usage in either side Fleuntd nor Elasticsearch. Seems like the connection is just lost and never regained.

Steps to replicate

Input config on the forwarders:

        <source>
      @type tail
      path /var/log/containers/*.log
      # exclude Fluentd logs
      exclude_path [\"/var/log/containers/*fluentd*.log\", \"/var/log/containers/*kube-dash*.log\"]
      pos_file /opt/bitnami/fluentd/logs/buffers/fluentd-docker.pos
      tag kubernetes.*
      read_from_head true
      <parse>
        @type json
        keep_time_key true
      </parse>
    </source>

Output config on aggregator:

        <match **>
          @type copy
          <store>
            @type elasticsearch
            hosts hostname
            request_timeout 30s
            resurrect_after 5
            # avoid https://discuss.elastic.co/t/elasitcsearch-ruby-raises-cannot-get
            # -new-connection-from-pool-error/36252/6
            reload_connections false
            reconnect_on_error true
            reload_on_failure true
            logstash_format true
            logstash_prefix logs-eks-s-test-1
            logstash_dateformat %Y.%m.%d
            # @timestamp: use event time, not time of indexing
            time_key time
            include_tag_key true
            include_timestamp true
            <buffer>
              @type file
              path /opt/bitnami/fluentd/logs/buffers
              flush_interval 1s
              flush_thread_count 20
              chunk_limit_size 16m
              total_limit_size 2048m
              queued_chunks_limit_size 4096
              overflow_action drop_oldest_chunk
              retry_forever true
            </buffer>
          </store>

Expected Behavior or What you need to ask

Constantly running fleuntd and sending logs to elasticsearch without interruptions

Using Fluentd and ES plugin versions

OS version:

[ec2-user@ip~]$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"

Bare Metal or within Docker or Kubernetes or others?:

Kubernetes

Fluentd v0.12 or v0.14/v1.0: fluentd-0:/opt/bitnami/fluentd$ fluentd --version fluentd 1.13.0
ES plugin 3.x.y/2.x.y or 1.x.y

fluentd 09:25:12.45 Welcome to the Bitnami fluentd container
fluentd 09:25:12.45 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-fluentd
fluentd 09:25:12.45 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-fluentd/issues
fluentd 09:25:12.45 
fluentd 09:25:12.45 INFO  ==> ** Starting Fluentd setup **
fluentd 09:25:12.47 INFO  ==> ** Fluentd setup finished! **
fluentd 09:25:12.48 INFO  ==> ** Starting Fluentd **
2021-07-29 09:25:13 +0000 [info]: parsing config file is succeeded path="/opt/bitnami/fluentd/conf/fluentd.conf"
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-concat' version '2.5.0'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-detect-exceptions' version '0.0.13'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '5.0.3'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-grafana-loki' version '1.2.16'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-kafka' version '0.16.3'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-kubernetes_metadata_filter' version '2.7.0'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-multi-format-parser' version '1.0.0'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-prometheus' version '2.0.1'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.4.0'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-s3' version '1.6.0'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-systemd' version '1.0.5'
2021-07-29 09:25:13 +0000 [info]: gem 'fluentd' version '1.13.0'
2021-07-29 09:25:13 +0000 [info]: gem 'fluentd' version '1.12.4'
2021-07-29 09:25:14 +0000 [warn]: define <match fluent.**> to capture fluentd logs in top level is deprecated. Use <label @FLUENT_LOG> instead
2021-07-29 09:25:14 +0000 [info]: using configuration file: <ROOT>
  <match fluent.**>
    @type stdout
    @log_level "debug"
  </match>
  <source>
    @type forward
    @log_level "debug"
    bind "0.0.0.0"
    port 24224
  </source>
  <source>
    @type http
    bind "0.0.0.0"
    port 9880
  </source>
  <match fluentd.healthcheck>
    @type stdout
  </match>
  <match **>
    @type copy
    <store>
      @type "elasticsearch"
      hosts "https://search.*.es.amazonaws.com"
      request_timeout 30s
      resurrect_after 5
      reload_connections false
      reconnect_on_error true
      reload_on_failure true
      logstash_format true
      logstash_prefix "logs-eks-s-test-1"
      logstash_dateformat "%Y.%m.%d"
      time_key "time"
      include_tag_key true
      include_timestamp true
      <buffer>
        @type "file"
        path "/opt/bitnami/fluentd/logs/buffers"
        flush_interval 1s
        flush_thread_count 20
        chunk_limit_size 16m
        total_limit_size 2048m
        queued_chunks_limit_size 4096
        overflow_action drop_oldest_chunk
        retry_forever true
      </buffer>
    </store>
  </match>
</ROOT>
2021-07-29 09:25:14 +0000 [info]: starting fluentd-1.13.0 pid=1 ruby="2.6.7"
2021-07-29 09:25:14 +0000 [info]: spawn command to main:  cmdline=["/opt/bitnami/ruby/bin/ruby", "-Eascii-8bit:ascii-8bit", "/opt/bitnami/fluentd/bin/fluentd", "--config", "/opt/bitnami/fluentd/conf/fluentd.conf", "--plugin", "/opt/bitnami/fluentd/plugins", "--under-supervisor"]
2021-07-29 09:25:15 +0000 [info]: adding match pattern="fluent.**" type="stdout"
2021-07-29 09:25:15 +0000 [info]: adding match pattern="fluentd.healthcheck" type="stdout"
2021-07-29 09:25:15 +0000 [info]: adding match pattern="**" type="copy"
2021-07-29 09:25:16 +0000 [info]: #0 Detected ES 6.x: ES 7.x will only accept `_doc` in type_name.
2021-07-29 09:25:16 +0000 [info]: adding source type="forward"
2021-07-29 09:25:16 +0000 [info]: adding source type="http"
2021-07-29 09:25:16 +0000 [warn]: #0 define <match fluent.**> to capture fluentd logs in top level is deprecated. Use <label @FLUENT_LOG> instead
2021-07-29 09:25:16 +0000 [info]: #0 starting fluentd worker pid=29 ppid=1 worker=0
2021-07-29 09:25:16 +0000 [info]: #0 listening port port=24224 bind="0.0.0.0"
2021-07-29 09:25:16 +0000 [info]: #0 fluentd worker is now running worker=0
2021-07-29 09:25:16.547156525 +0000 fluent.info: {"pid":29,"ppid":1,"worker":0,"message":"starting fluentd worker pid=29 ppid=1 worker=0"}
2021-07-29 09:25:16.558448292 +0000 fluent.info: {"port":24224,"bind":"0.0.0.0","message":"listening port port=24224 bind=\"0.0.0.0\""}
2021-07-29 09:25:16.559035109 +0000 fluent.info: {"worker":0,"message":"fluentd worker is now running worker=0"}

I'm encountering a similar problem on a kubernetes AKS installation using fluentd deamonset. Configuration:

Kubernetes AKS v1.20.7
Fluentd Daemonset image fluent/fluentd-kubernetes-daemonset:v1.14-debian-elasticsearch7-1
ES plugin v5.1.5
Elastic cloud managed v7.15.1

Fluent plugin configuration

2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-concat' version '2.5.0'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-dedot_filter' version '1.0.0'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-detect-exceptions' version '0.0.14'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '5.1.5'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-grok-parser' version '2.6.2'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-json-in-json-2' version '1.0.2'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-kubernetes_metadata_filter' version '2.9.4'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-multi-format-parser' version '1.0.0'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-parser-cri' version '0.1.1'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-prometheus' version '2.0.2'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-record-modifier' version '2.1.0'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.4.0'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-systemd' version '1.0.5'
2022-03-02 09:03:34 +0000 [info]: gem 'fluentd' version '1.14.5'

Output Configuration

   <match **>
       @type elasticsearch_data_stream
      @id out_es
       @log_level "info"
       include_tag_key true
       request_timeout 15s
       data_stream_name "logs-cluster-test"
       host "my-elastic-cloud.com"
       port 9243
       path ""
       scheme https
       ssl_verify true
       ssl_version TLSv1_2
       user "elastic"
       password xxxxxx
       <buffer>
         flush_thread_count 8
         flush_interval 5s
         chunk_limit_size 2M
         queue_limit_length 32
         retry_max_interval 30
         retry_forever true
       </buffer>
     </match>

After some hours this error occurs:

2022-03-02 02:18:42 +0000 [warn]: #0 [out_es] failed to flush the buffer. retry_times=5 next_retry_time=2022-03-02 02:18:58 +0000 chunk="5d932e3443b0395a4f70f649ed37c03e" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster (logs-cluster-test): connect_write timeout reached"
  2022-03-02 02:18:42 +0000 [warn]: #0 suppressed same stacktrace
2022-03-02 02:18:42 +0000 [warn]: #0 [out_es] failed to flush the buffer. retry_times=5 next_retry_time=2022-03-02 02:18:58 +0000 chunk="5d932e52a843cbf6d6be5aa934c9f0be" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster (logs-cluster-test): connect_write timeout reached"
  2022-03-02 02:18:42 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluent-plugin-elasticsearch-5.1.5/lib/fluent/plugin/out_elasticsearch_data_stream.rb:249:in 'rescue in write'
  2022-03-02 02:18:42 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluent-plugin-elasticsearch-5.1.5/lib/fluent/plugin/out_elasticsearch_data_stream.rb:243:in 'write'
  2022-03-02 02:18:42 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluentd-1.14.5/lib/fluent/plugin/output.rb:1179:in 'try_flush'
  2022-03-02 02:18:42 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluentd-1.14.5/lib/fluent/plugin/output.rb:1495:in 'flush_thread_run'
  2022-03-02 02:18:42 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluentd-1.14.5/lib/fluent/plugin/output.rb:499:in 'block (2 levels) in start'
  2022-03-02 02:18:42 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluentd-1.14.5/lib/fluent/plugin_helper/thread.rb:78:in 'block in thread_create

After this error occurs fluentd keeps trying to connect to ES but no more logs are send to ES

 2022-03-02 02:21:32 +0000 [warn]: #0 [out_es] failed to flush the buffer. retry_times=10 next_retry_time=2022-03-02 02:22:02 +0000 chunk="5d932e5a693fa1544cc32d138971f5f8" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster (logs-cluster-test): connect_write timeout reached"
 2022-03-02 02:21:32 +0000 [warn]: #0 suppressed same stacktrace

Eventually the buffer reach it's limit and fluientd start logging errors of type BufferOverflowError. Metrics don't show anything noticeable, there are no spikes in communication between fluentd and ES before the error. After the error there in no output traffic and the buffer starts to get filled. There is no abnormal usage of cpu by the pods, neither before nor after the error occurs.

uken / fluent-plugin-elasticsearch