fluentd does not flush buffer for unknown reason

dmnfortytwo commented 6 years ago

I have configuration with multiple workers:

<source>
  @type multiprocess
  <process>
    cmdline -c /etc/td-agent/child_0.conf --log /var/log/td-agent/child_0.log
    sleep_before_start 1s
    sleep_before_shutdown 5s
  </process>
  <process>
    cmdline -c /etc/td-agent/child_1.conf --log /var/log/td-agent/child_1.log
    sleep_before_start 1s
    sleep_before_shutdown 5s
  </process>
  <process>
    cmdline -c /etc/td-agent/child_2.conf --log /var/log/td-agent/child_2.log
    sleep_before_start 1s
    sleep_before_shutdown 5s
  </process>
...
</source>

Children are similar, like this;

<source>
  @type forward
  tag forwarded
  port 34221
  bind 0.0.0.0
  source_address_key forwarder_ip
  source_hostname_key forwarder_hostname
    <transport tls>
      ca_cert_path /etc/td-agent/server.pem
      ca_private_key_path /etc/td-agent/server.key
    </transport>
  <security>
    shared_key key
    self_hostname harvester-gce-be-1
  </security>
</source>
<filter **>
  @type record_transformer
  enable_ruby true
  <record>
    pushed_by harvester-gce-be-1:1
    target_index ${if record['project']; sprintf('%s--%s',record['project'],time.strftime('%Y.%m.%d')); else sprintf('fallback--%s',time.strftime('%Y.%m.%d')); end}
    severity ${record["severity"].downcase if record["severity"]}
  </record>
</filter>
<match **>
  @type copy
  <store>
    @type elasticsearch
    host localhost
    port 9200
    target_type_key @target_key
    type_name fluentd
    target_index_key target_index
    logstash_format true
    logstash_prefix invalid--
    time_key @timestamp
    include_timestamp true
    reconnect_on_error true
    reload_on_failure true
    reload_connections false
    request_timeout 120s
    <buffer>
      @type file
      flush_interval 10s
      retry_type periodic
      retry_forever true
      retry_wait 10s
      chunk_limit_size 16Mb
      queue_limit_length 4096
      total_limit_size 60Gb
      path /var/lib/td-agent/buffers/output_elasticsearch-1
    </buffer>
  </store>
</match>

After ES maintenance there were some data in buffers, and after a while some workers pushed it to ES and other did not:

220K    /var/lib/td-agent/buffers/output_elasticsearch
329M    /var/lib/td-agent/buffers/output_elasticsearch-1
624K    /var/lib/td-agent/buffers/output_elasticsearch-10
184K    /var/lib/td-agent/buffers/output_elasticsearch-11
52K     /var/lib/td-agent/buffers/output_elasticsearch-12
2.2M    /var/lib/td-agent/buffers/output_elasticsearch-13
1.5M    /var/lib/td-agent/buffers/output_elasticsearch-14
4.2M    /var/lib/td-agent/buffers/output_elasticsearch-15
166M    /var/lib/td-agent/buffers/output_elasticsearch-2
336M    /var/lib/td-agent/buffers/output_elasticsearch-3
236M    /var/lib/td-agent/buffers/output_elasticsearch-4
291M    /var/lib/td-agent/buffers/output_elasticsearch-5
244M    /var/lib/td-agent/buffers/output_elasticsearch-6
346M    /var/lib/td-agent/buffers/output_elasticsearch-7
216K    /var/lib/td-agent/buffers/output_elasticsearch-8
1.6M    /var/lib/td-agent/buffers/output_elasticsearch-9
2.0G    total

root@harvester-gce-be-1:/var/log/td-agent# date; ls -la /var/lib/td-agent/buffers/output_elasticsearch-4 | head
Sun Apr 29 10:24:26 UTC 2018
total 240552
drwxr-xr-x  2 td-agent td-agent 1228800 Apr 29 10:24 .
drwxr-x--- 19 td-agent adm         4096 Apr 28 18:23 ..
-rw-r--r--  1 td-agent td-agent 4020703 Apr 28 17:35 buffer.b56aec0b33cfe06d6a058c9e15a8ec234.log
-rw-r--r--  1 td-agent td-agent      84 Apr 28 17:35 buffer.b56aec0b33cfe06d6a058c9e15a8ec234.log.meta
-rw-r--r--  1 td-agent td-agent 4056767 Apr 28 17:37 buffer.b56aec1403175b6e7606b64e393b75ab9.log
-rw-r--r--  1 td-agent td-agent      84 Apr 28 17:37 buffer.b56aec1403175b6e7606b64e393b75ab9.log.meta
-rw-r--r--  1 td-agent td-agent 4000618 Apr 28 17:40 buffer.b56aec1d3e0db793dcbfd939a9120a90f.log
-rw-r--r--  1 td-agent td-agent      84 Apr 28 17:40 buffer.b56aec1d3e0db793dcbfd939a9120a90f.log.meta
-rw-r--r--  1 td-agent td-agent 3990132 Apr 28 17:41 buffer.b56aec217a37b5affa8209f739833d5e9.log

There is nothing suspicious in logs except two things:

2018-04-29 08:34:29 +0000 [warn]: #0 chunk bytes limit exceeds for an emitted event stream: 17434337bytes
2018-04-29 08:58:31 +0000 [warn]: #0 chunk bytes limit exceeds for an emitted event stream: 32323183bytes
2018-04-29 08:59:33 +0000 [error]: #0 unexpected error on reading data host="haproxy-gce-be-2" port=36334 error_class=NoMethodError error="undefined method `[]=' for nil:NilClass"

Warning about chunk size was because I played with chunk_limit_size, but I've set it to previous value and restarted td-agent. Nothing changed though - no errors messages ATM and still not empty buffer.

dmnfortytwo commented 6 years ago

More info: children 8-15 were added after ES maintenance /var/lib/td-agent/buffers/output_elasticsearch is used for child without in_forward plugin (for local logs)

cosmo0920 commented 6 years ago

Fill the issue template, please. ES plugin has two maintenance lines -- which is for Fluentd v0.12 and v1.

cosmo0920 commented 6 years ago

For v1, you shouldn't use fluent-plugin-multiprocess. Instead, you should use workers parameter like this:

<system>
  workers 3
</system>

dmnfortytwo commented 6 years ago

Sorry.

2018-04-29 10:11:02 +0000 [info]: starting fluentd-1.0.2 pid=16683 ruby="2.4.2"
2018-04-29 10:11:02 +0000 [info]: spawn command to main:  cmdline=["/opt/td-agent/embedded/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/sbin/td-agent",
 "--log", "/var/log/td-agent/td-agent.log", "--under-supervisor"]
2018-04-29 10:11:02 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '2.4.0'
2018-04-29 10:11:02 +0000 [info]: gem 'fluent-plugin-flowcounter' version '1.3.0'
2018-04-29 10:11:02 +0000 [info]: gem 'fluent-plugin-kafka' version '0.6.5'
2018-04-29 10:11:02 +0000 [info]: gem 'fluent-plugin-multiprocess' version '0.2.2'
2018-04-29 10:11:02 +0000 [info]: gem 'fluent-plugin-prometheus' version '1.0.1'
2018-04-29 10:11:02 +0000 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.0.1'
2018-04-29 10:11:02 +0000 [info]: gem 'fluent-plugin-s3' version '1.1.0'
2018-04-29 10:11:02 +0000 [info]: gem 'fluent-plugin-td' version '1.0.0'
2018-04-29 10:11:02 +0000 [info]: gem 'fluent-plugin-td-monitoring' version '0.2.3'
2018-04-29 10:11:02 +0000 [info]: gem 'fluent-plugin-webhdfs' version '1.2.2'
2018-04-29 10:11:02 +0000 [info]: gem 'fluentd' version '1.0.2'

dmnfortytwo commented 6 years ago

I've got some issues with worker, but I don't exactly remember it. If i don't mind, it did not work with in_forward plugin.

cosmo0920 commented 6 years ago

After Fluentd complains 2018-04-29 08:59:33 +0000 [error]: #0 unexpected error on reading data host="haproxy-gce-be-2" port=36334 error_class=NoMethodError error="undefined method `[]=' for nil:NilClass", ES plugin cannot continue to process events.

dmnfortytwo commented 6 years ago

Why? Could you explain this error?

cosmo0920 commented 6 years ago

This error is generated from Fluentd core which handles buffers.

dmnfortytwo commented 6 years ago

Is it expected behavior? What shall I do here?

cosmo0920 commented 6 years ago

I have no idea. But, it seems unknown issue on Fluentd core, I guess.

dmnfortytwo commented 6 years ago

Shall I create issue in core repo? @repeatedly could you kindly take a look?

jcantrill commented 6 years ago

@cosmo0920 Might this be related to the changes just merged where this plugin was unable to flush buffers if the deserialization from msgpack failed. This case we fixed by introducing an error handler and pushing off those messages.

cosmo0920 commented 6 years ago

Hmm..., I see. I've uplifted @jcantrill 's patch on #414. It might be going to reduce this issue.

jcantrill commented 6 years ago

I was actually refering to #402 but #414 will help this plugin during retries

dmnfortytwo commented 6 years ago

Any updates here? I have few GB's of logs lost in buffers and this makes me cry.

dmnfortytwo commented 6 years ago

I've just realized that fluentd pushes some part of buffer after restart. I've written a script which restarts fluentd in endless loop and it slowly pushes buffer into elasticsearch

root@harvester-gce-be-1:~# ./r.sh 2>&1 | tee r.log
312680  /var/lib/td-agent/buffers/output_elasticsearch-1
td-agent stop/waiting
td-agent start/running, process 7355
304848  /var/lib/td-agent/buffers/output_elasticsearch-1
td-agent stop/waiting
td-agent start/running, process 7948
300888  /var/lib/td-agent/buffers/output_elasticsearch-1
td-agent stop/waiting
td-agent start/running, process 8473
296916  /var/lib/td-agent/buffers/output_elasticsearch-1
td-agent stop/waiting
td-agent start/running, process 9048
293008  /var/lib/td-agent/buffers/output_elasticsearch-1
td-agent stop/waiting
td-agent start/running, process 9698
289108  /var/lib/td-agent/buffers/output_elasticsearch-1
td-agent stop/waiting
td-agent start/running, process 10242
285132  /var/lib/td-agent/buffers/output_elasticsearch-1

chicofranchico commented 6 years ago

starting fluentd-1.2.2 pid=50 ruby="2.4.4" spawn command to main: cmdline=["/usr/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/bin/fluentd", "-c", "/home/fluentd/fluent.conf", "-p", "/home/fluentd/plugins", "--under-supervisor"] gem 'fluent-plugin-cloudwatch-ingest' version '1.7.0.rc4' gem 'fluent-plugin-elasticsearch' version '2.11.1' gem 'fluent-plugin-flowcounter' version '1.3.0' gem 'fluent-plugin-multiprocess' version '0.2.2' gem 'fluent-plugin-prometheus' version '1.0.1' gem 'fluent-plugin-remote_cef' version '1.0.0' gem 'fluentd' version '1.2.2' gem 'fluentd' version '0.14.25'

I have the same issue when using fluentd as an aggregator before sending to ES:

#plugin for monitoring fluent-agent itself with prometheus
<source>
  @type prometheus
  port 24220
</source>
# input plugin that collects metrics from MonitorAgent
<source>
  @type prometheus_monitor
</source>
# input plugin that collects metrics for output plugin
<source>
  @type prometheus_output_monitor
</source>
# input plugin that collects metrics for in_tail plugin
<source>
  @type prometheus_tail_monitor
</source>
<source>
  @type  forward
  @id    forward
  @label @mainstream
  port  24224
</source>
<source>
  @type  http
  @id    http
  @label @mainstream
  port  24225
</source>
<match **>
  @type stdout
</match>
<label @mainstream>
  <match mesos.slave>
    @type copy
    <store>
      @type elasticsearch
      host host.example.com
      port 9200
      time_key_format   %Y%m%dT%H%M%S.%N%z
      logstash_format true
      time_key time
      request_timeout 30s
      logstash_prefix mesos_slave
      <buffer>
        @type file
        flush_at_shutdown false
        flush_interval 45s
        chunk_limit_size 8M
        queued_chunks_limit_size 256
        path /fluentd/log/backlog/mesos-slave.log
      </buffer>
    </store>
    </match>
      <match system.auth>
      @type copy
        <store>
          @type elasticsearch
          host host.example.com
          port 9200
          time_key_format   %Y%m%dT%H%M%S.%N%z
          logstash_format true
          time_key time
          request_timeout 30s
          logstash_prefix auth
          <buffer>
            @type file
            flush_at_shutdown false
            flush_interval 45s
            chunk_limit_size 8M
            queued_chunks_limit_size 256
            path /fluentd/log/backlog/auth.log
          </buffer>
        </store>
      </match>
    <match system.syslog>
      @type copy
      <store>
        @type elasticsearch
        host host.example.com
        port 9200
        time_key_format   %Y%m%dT%H%M%S.%N%z
        logstash_format true
        time_key time
        request_timeout 30s
        logstash_prefix syslog
        <buffer>
          @type file
          flush_at_shutdown false
          flush_interval 45s
          chunk_limit_size 8M
          queued_chunks_limit_size 256
          path /fluentd/log/backlog/syslog
        </buffer>
      </store>
    </match>
    <match docker.**>
      @type copy
      <store>
        @type elasticsearch
        host host.example.com
        port 9200
        time_key_format   %Y%m%dT%H%M%S.%N%z
        logstash_format true
        time_key time
        request_timeout 30s
        logstash_prefix docker
        <buffer>
          @type file
          flush_at_shutdown false
          flush_interval 45s
          chunk_limit_size 8M
          queued_chunks_limit_size 256
          path /fluentd/log/backlog/docker.log
        </buffer>
      </store>
    </match>
    <match mesos.master>
      @type copy
      <store>
        @type elasticsearch
        host host.example.com
        port 9200
        time_key_format   %Y%m%dT%H%M%S.%N%z
        logstash_format true
        time_key time
        request_timeout 30s
        logstash_prefix mesos_master
        <buffer>
          @type file
          flush_at_shutdown false
          flush_interval 45s
          chunk_limit_size 8M
          queued_chunks_limit_size 256
          path /fluentd/log/backlog/mesos-master.log
        </buffer>
      </store>
    </match>
    <match marathon.service>
      @type copy
      <store>
        @type elasticsearch
        host host.example.com
        port 9200
        time_key_format   %Y%m%dT%H%M%S.%N%z
        logstash_format true
        time_key time
        request_timeout 30s
        logstash_prefix marathon
        <buffer>
          @type file
          flush_at_shutdown false
          flush_interval 45s
          chunk_limit_size 8M
          queued_chunks_limit_size 256
          path /fluentd/log/backlog/marathon.log
        </buffer>
      </store>
    </match>
    <match **>
      @type elasticsearch
      host host.example.com
      port 9200
      time_key_format   %Y%m%dT%H%M%S.%N%z
      logstash_format true
      time_key time
      request_timeout 30s
      logstash_prefix catchall
      <buffer>
        @type file
        flush_at_shutdown false
        flush_interval 45s
        chunk_limit_size 8M
        queued_chunks_limit_size 256
        path /fluentd/log/backlog/catchall
      </buffer>
    </match>
</label>

Some of the buffered logs are around the same time and during that time neither ES was stopped nor fluentd:

fluentd-aggregator-active/backlog# for i in * ; do echo $i ; ls -l $i | awk -F' ' '{print $8}' | sort | uniq -c ; done
auth.log
      1
      2 13:18
catchall
      1
docker.log
      1
      2 13:18
marathon.log
      1
      2 12:02
      2 12:12
      2 12:21
      2 12:52
      2 13:18
     44 15:31
     46 15:35
     46 15:37
     23 15:39
     23 15:40
     42 15:41
     54 15:44
      2 15:49
     32 17:07
      7 17:12
     29 17:13
     36 17:40
     10 17:42
      4 17:46
     58 17:47
     20 17:51
     62 17:58
     22 18:03
      4 18:08
     11 18:17
     45 18:18
     16 18:21
     56 18:28
     30 18:48
     36 18:49
     58 18:51
      7 18:54
     51 18:55
mesos-master.log
      1
      2 12:02
      2 12:12
      2 12:21
      2 12:52
      2 13:18
     26 15:33
     32 15:34
     22 15:38
     38 15:39
     13 15:40
     43 15:41
     62 15:42
     36 15:43
     54 15:49
      2 15:51
     14 15:52
      5 15:54
     55 15:55
     44 15:59
     35 16:01
     20 16:02
     23 16:03
     28 16:07
     24 16:10
     58 16:12
     22 16:15
     36 16:16
      7 16:18
     22 16:19
     33 16:20
     32 16:24
     38 16:29
     16 16:31
     24 16:34
     42 16:45
     24 16:49
     12 17:00
     18 17:01
     19 17:03
      7 17:04
     36 17:05
     34 17:11
     12 17:37
     16 17:38
     35 17:43
     25 17:44
     20 17:56
     28 17:57
      6 17:59
     54 18:00
     15 18:11
      7 18:12
      8 18:20
      1 18:25
     29 18:26
      1 18:28
     15 18:29
      6 18:52
     32 18:53
     16 18:56
     46 18:57
     17 19:01
     45 19:02
     14 19:28
     20 19:29
     42 19:30
      3 19:31
     49 19:32
     62 19:34
     15 19:38
     45 19:39
     12 19:40
     50 19:41
     60 19:43
     60 19:45
     15 19:46
     45 19:47
     60 19:49
     58 19:50
     58 19:52
     11 19:53
     47 19:54
     60 19:55
     60 19:57
     17 19:58
     45 19:59
mesos-slave.log
      1
      2 12:12
      2 12:52
      2 13:18
     14 15:30
     20 15:34
      2 15:39
      2 15:43
      2 15:44
     20 15:53
     22 15:56
     24 15:58
      6 16:17
     22 16:22
     38 16:36
      9 16:41
     33 16:42
     36 16:47
     17 16:53
     24 16:54
      1 16:55
      2 17:48
      3 18:15
      3 18:16
      5 18:24
     25 18:25
     22 18:27
      2 19:59
syslog
      1
      2 12:03
      2 12:22
      2 12:52
      2 13:18
      8 15:27
     22 15:28
      2 15:35
      2 15:39
      2 15:43
      2 15:44
      2 15:50
      7 16:04
     27 16:05
      6 16:06
      2 16:22
      8 16:25
     26 16:26
     20 16:30
     22 16:31
      8 16:33
     15 16:38
     35 16:39
     26 16:43
     28 16:50
      4 16:51
      2 17:09
      2 17:48
      2 19:02
      2 19:59

chicofranchico commented 6 years ago

In my case, restarting fluentd doesn't do anything with these already buffered files

dmnfortytwo commented 6 years ago

@chicofranchico try this: https://github.com/fluent/fluentd/issues/2131#issuecomment-422876949

chicofranchico commented 6 years ago

What's the difference between buffer.b* and buffer.q* files?

dmnfortytwo commented 6 years ago

I'm not good enough in reading ruby code, but as I understood b chunks are a kind of input buffers and q are ready-to-ship (queued) chunks.

UPD: looks like stalled b* chunks appear when fluentd process is killed for some reason and regular data flow is interrupted.

hostalp commented 2 years ago

This keeps happening to us quite often. Renaming buffer B files to Q doesn't help, we basically have to delete the buffer files everytime this happens.

uken / fluent-plugin-elasticsearch

fluentd does not flush buffer for unknown reason #413