Open dmnfortytwo opened 6 years ago
More info: children 8-15 were added after ES maintenance /var/lib/td-agent/buffers/output_elasticsearch is used for child without in_forward plugin (for local logs)
Fill the issue template, please. ES plugin has two maintenance lines -- which is for Fluentd v0.12 and v1.
For v1, you shouldn't use fluent-plugin-multiprocess. Instead, you should use workers parameter like this:
<system>
workers 3
</system>
Sorry.
2018-04-29 10:11:02 +0000 [info]: starting fluentd-1.0.2 pid=16683 ruby="2.4.2"
2018-04-29 10:11:02 +0000 [info]: spawn command to main: cmdline=["/opt/td-agent/embedded/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/sbin/td-agent",
"--log", "/var/log/td-agent/td-agent.log", "--under-supervisor"]
2018-04-29 10:11:02 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '2.4.0'
2018-04-29 10:11:02 +0000 [info]: gem 'fluent-plugin-flowcounter' version '1.3.0'
2018-04-29 10:11:02 +0000 [info]: gem 'fluent-plugin-kafka' version '0.6.5'
2018-04-29 10:11:02 +0000 [info]: gem 'fluent-plugin-multiprocess' version '0.2.2'
2018-04-29 10:11:02 +0000 [info]: gem 'fluent-plugin-prometheus' version '1.0.1'
2018-04-29 10:11:02 +0000 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.0.1'
2018-04-29 10:11:02 +0000 [info]: gem 'fluent-plugin-s3' version '1.1.0'
2018-04-29 10:11:02 +0000 [info]: gem 'fluent-plugin-td' version '1.0.0'
2018-04-29 10:11:02 +0000 [info]: gem 'fluent-plugin-td-monitoring' version '0.2.3'
2018-04-29 10:11:02 +0000 [info]: gem 'fluent-plugin-webhdfs' version '1.2.2'
2018-04-29 10:11:02 +0000 [info]: gem 'fluentd' version '1.0.2'
I've got some issues with worker
, but I don't exactly remember it.
If i don't mind, it did not work with in_forward plugin.
After Fluentd complains 2018-04-29 08:59:33 +0000 [error]: #0 unexpected error on reading data host="haproxy-gce-be-2" port=36334 error_class=NoMethodError error="undefined method `[]=' for nil:NilClass"
, ES plugin cannot continue to process events.
Why? Could you explain this error?
This error is generated from Fluentd core which handles buffers.
Is it expected behavior? What shall I do here?
I have no idea. But, it seems unknown issue on Fluentd core, I guess.
Shall I create issue in core repo? @repeatedly could you kindly take a look?
@cosmo0920 Might this be related to the changes just merged where this plugin was unable to flush buffers if the deserialization from msgpack failed. This case we fixed by introducing an error handler and pushing off those messages.
Hmm..., I see. I've uplifted @jcantrill 's patch on #414. It might be going to reduce this issue.
I was actually refering to #402 but #414 will help this plugin during retries
Any updates here? I have few GB's of logs lost in buffers and this makes me cry.
I've just realized that fluentd pushes some part of buffer after restart. I've written a script which restarts fluentd in endless loop and it slowly pushes buffer into elasticsearch
root@harvester-gce-be-1:~# ./r.sh 2>&1 | tee r.log
312680 /var/lib/td-agent/buffers/output_elasticsearch-1
td-agent stop/waiting
td-agent start/running, process 7355
304848 /var/lib/td-agent/buffers/output_elasticsearch-1
td-agent stop/waiting
td-agent start/running, process 7948
300888 /var/lib/td-agent/buffers/output_elasticsearch-1
td-agent stop/waiting
td-agent start/running, process 8473
296916 /var/lib/td-agent/buffers/output_elasticsearch-1
td-agent stop/waiting
td-agent start/running, process 9048
293008 /var/lib/td-agent/buffers/output_elasticsearch-1
td-agent stop/waiting
td-agent start/running, process 9698
289108 /var/lib/td-agent/buffers/output_elasticsearch-1
td-agent stop/waiting
td-agent start/running, process 10242
285132 /var/lib/td-agent/buffers/output_elasticsearch-1
starting fluentd-1.2.2 pid=50 ruby="2.4.4" spawn command to main: cmdline=["/usr/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/bin/fluentd", "-c", "/home/fluentd/fluent.conf", "-p", "/home/fluentd/plugins", "--under-supervisor"] gem 'fluent-plugin-cloudwatch-ingest' version '1.7.0.rc4' gem 'fluent-plugin-elasticsearch' version '2.11.1' gem 'fluent-plugin-flowcounter' version '1.3.0' gem 'fluent-plugin-multiprocess' version '0.2.2' gem 'fluent-plugin-prometheus' version '1.0.1' gem 'fluent-plugin-remote_cef' version '1.0.0' gem 'fluentd' version '1.2.2' gem 'fluentd' version '0.14.25'
I have the same issue when using fluentd as an aggregator before sending to ES:
#plugin for monitoring fluent-agent itself with prometheus
<source>
@type prometheus
port 24220
</source>
# input plugin that collects metrics from MonitorAgent
<source>
@type prometheus_monitor
</source>
# input plugin that collects metrics for output plugin
<source>
@type prometheus_output_monitor
</source>
# input plugin that collects metrics for in_tail plugin
<source>
@type prometheus_tail_monitor
</source>
<source>
@type forward
@id forward
@label @mainstream
port 24224
</source>
<source>
@type http
@id http
@label @mainstream
port 24225
</source>
<match **>
@type stdout
</match>
<label @mainstream>
<match mesos.slave>
@type copy
<store>
@type elasticsearch
host host.example.com
port 9200
time_key_format %Y%m%dT%H%M%S.%N%z
logstash_format true
time_key time
request_timeout 30s
logstash_prefix mesos_slave
<buffer>
@type file
flush_at_shutdown false
flush_interval 45s
chunk_limit_size 8M
queued_chunks_limit_size 256
path /fluentd/log/backlog/mesos-slave.log
</buffer>
</store>
</match>
<match system.auth>
@type copy
<store>
@type elasticsearch
host host.example.com
port 9200
time_key_format %Y%m%dT%H%M%S.%N%z
logstash_format true
time_key time
request_timeout 30s
logstash_prefix auth
<buffer>
@type file
flush_at_shutdown false
flush_interval 45s
chunk_limit_size 8M
queued_chunks_limit_size 256
path /fluentd/log/backlog/auth.log
</buffer>
</store>
</match>
<match system.syslog>
@type copy
<store>
@type elasticsearch
host host.example.com
port 9200
time_key_format %Y%m%dT%H%M%S.%N%z
logstash_format true
time_key time
request_timeout 30s
logstash_prefix syslog
<buffer>
@type file
flush_at_shutdown false
flush_interval 45s
chunk_limit_size 8M
queued_chunks_limit_size 256
path /fluentd/log/backlog/syslog
</buffer>
</store>
</match>
<match docker.**>
@type copy
<store>
@type elasticsearch
host host.example.com
port 9200
time_key_format %Y%m%dT%H%M%S.%N%z
logstash_format true
time_key time
request_timeout 30s
logstash_prefix docker
<buffer>
@type file
flush_at_shutdown false
flush_interval 45s
chunk_limit_size 8M
queued_chunks_limit_size 256
path /fluentd/log/backlog/docker.log
</buffer>
</store>
</match>
<match mesos.master>
@type copy
<store>
@type elasticsearch
host host.example.com
port 9200
time_key_format %Y%m%dT%H%M%S.%N%z
logstash_format true
time_key time
request_timeout 30s
logstash_prefix mesos_master
<buffer>
@type file
flush_at_shutdown false
flush_interval 45s
chunk_limit_size 8M
queued_chunks_limit_size 256
path /fluentd/log/backlog/mesos-master.log
</buffer>
</store>
</match>
<match marathon.service>
@type copy
<store>
@type elasticsearch
host host.example.com
port 9200
time_key_format %Y%m%dT%H%M%S.%N%z
logstash_format true
time_key time
request_timeout 30s
logstash_prefix marathon
<buffer>
@type file
flush_at_shutdown false
flush_interval 45s
chunk_limit_size 8M
queued_chunks_limit_size 256
path /fluentd/log/backlog/marathon.log
</buffer>
</store>
</match>
<match **>
@type elasticsearch
host host.example.com
port 9200
time_key_format %Y%m%dT%H%M%S.%N%z
logstash_format true
time_key time
request_timeout 30s
logstash_prefix catchall
<buffer>
@type file
flush_at_shutdown false
flush_interval 45s
chunk_limit_size 8M
queued_chunks_limit_size 256
path /fluentd/log/backlog/catchall
</buffer>
</match>
</label>
Some of the buffered logs are around the same time and during that time neither ES was stopped nor fluentd:
fluentd-aggregator-active/backlog# for i in * ; do echo $i ; ls -l $i | awk -F' ' '{print $8}' | sort | uniq -c ; done
auth.log
1
2 13:18
catchall
1
docker.log
1
2 13:18
marathon.log
1
2 12:02
2 12:12
2 12:21
2 12:52
2 13:18
44 15:31
46 15:35
46 15:37
23 15:39
23 15:40
42 15:41
54 15:44
2 15:49
32 17:07
7 17:12
29 17:13
36 17:40
10 17:42
4 17:46
58 17:47
20 17:51
62 17:58
22 18:03
4 18:08
11 18:17
45 18:18
16 18:21
56 18:28
30 18:48
36 18:49
58 18:51
7 18:54
51 18:55
mesos-master.log
1
2 12:02
2 12:12
2 12:21
2 12:52
2 13:18
26 15:33
32 15:34
22 15:38
38 15:39
13 15:40
43 15:41
62 15:42
36 15:43
54 15:49
2 15:51
14 15:52
5 15:54
55 15:55
44 15:59
35 16:01
20 16:02
23 16:03
28 16:07
24 16:10
58 16:12
22 16:15
36 16:16
7 16:18
22 16:19
33 16:20
32 16:24
38 16:29
16 16:31
24 16:34
42 16:45
24 16:49
12 17:00
18 17:01
19 17:03
7 17:04
36 17:05
34 17:11
12 17:37
16 17:38
35 17:43
25 17:44
20 17:56
28 17:57
6 17:59
54 18:00
15 18:11
7 18:12
8 18:20
1 18:25
29 18:26
1 18:28
15 18:29
6 18:52
32 18:53
16 18:56
46 18:57
17 19:01
45 19:02
14 19:28
20 19:29
42 19:30
3 19:31
49 19:32
62 19:34
15 19:38
45 19:39
12 19:40
50 19:41
60 19:43
60 19:45
15 19:46
45 19:47
60 19:49
58 19:50
58 19:52
11 19:53
47 19:54
60 19:55
60 19:57
17 19:58
45 19:59
mesos-slave.log
1
2 12:12
2 12:52
2 13:18
14 15:30
20 15:34
2 15:39
2 15:43
2 15:44
20 15:53
22 15:56
24 15:58
6 16:17
22 16:22
38 16:36
9 16:41
33 16:42
36 16:47
17 16:53
24 16:54
1 16:55
2 17:48
3 18:15
3 18:16
5 18:24
25 18:25
22 18:27
2 19:59
syslog
1
2 12:03
2 12:22
2 12:52
2 13:18
8 15:27
22 15:28
2 15:35
2 15:39
2 15:43
2 15:44
2 15:50
7 16:04
27 16:05
6 16:06
2 16:22
8 16:25
26 16:26
20 16:30
22 16:31
8 16:33
15 16:38
35 16:39
26 16:43
28 16:50
4 16:51
2 17:09
2 17:48
2 19:02
2 19:59
In my case, restarting fluentd doesn't do anything with these already buffered files
@chicofranchico try this: https://github.com/fluent/fluentd/issues/2131#issuecomment-422876949
What's the difference between buffer.b*
and buffer.q*
files?
I'm not good enough in reading ruby code, but as I understood b chunks are a kind of input buffers and q are ready-to-ship (queued) chunks.
UPD: looks like stalled b* chunks appear when fluentd process is killed for some reason and regular data flow is interrupted.
This keeps happening to us quite often. Renaming buffer B files to Q doesn't help, we basically have to delete the buffer files everytime this happens.
I have configuration with multiple workers:
Children are similar, like this;
After ES maintenance there were some data in buffers, and after a while some workers pushed it to ES and other did not:
There is nothing suspicious in logs except two things:
Warning about chunk size was because I played with
chunk_limit_size
, but I've set it to previous value and restarted td-agent. Nothing changed though - no errors messages ATM and still not empty buffer.