Closed stanhu closed 8 years ago
If I just look for connection messages, it seems that either td-agent or the plugin isn't even attempting to reconnect after the first connection:
$ /var/log/td-agent sudo zcat td-agent.log.2.gz | grep conn
2015-06-22 19:08:12 +0000 [warn]: disconnected from server.redacted.com
2015-06-22 19:08:15 +0000 [warn]: dead connection found: server.redacted.com, reconnecting...
2015-06-22 19:08:20 +0000 [warn]: recovered connection to dead node: server.redacted.com
2015-06-22 19:09:11 +0000 [warn]: disconnected from server.redacted.com
2015-06-22 19:09:15 +0000 [warn]: dead connection found: server.redacted.com, reconnecting...
2015-06-22 19:09:20 +0000 [warn]: recovered connection to dead node: server.redacted.com
2015-06-22 19:23:55 +0000 [warn]: disconnected from server2.redacted.com
2015-06-22 19:24:09 +0000 [warn]: disconnected from server.redacted.com
Can you see debug/trace logs?
These can be turned on by using -vv
option of Fluentd, specified by TD_AGENT_ARGS
or DAEMON_ARGS
for td-agent.
I'll close this at once. Please open another issue for further problems.
I'm seeing this issue again. I have two fluentd aggregators, each receiving 50% of the traffic. When I restart one of them, all my clients get into this state and stop sending data. I disabled keepalive
because this was causing an inordinate amount of connections/reconnections and causing lots of sockets on the server to get into CLOSE_WAIT.
Here is the log messages from one of the clients that get stuck:
2015-12-19 17:29:15 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2015-12-19 17:28:40 +0000 error_class="Errno::ECONNRESET" error="Connection reset by peer" plugin_id="object:3f96dc42eafc"
2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/2.1.0/openssl/buffering.rb:326:in `syswrite'
2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/2.1.0/openssl/buffering.rb:326:in `do_write'
2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/2.1.0/openssl/buffering.rb:344:in `write'
2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/buf_memory.rb:55:in `write_to'
2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-secure-forward-0.3.2/lib/fluent/plugin/out_secure_forward.rb:274:in `send_data'
2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-secure-forward-0.3.2/lib/fluent/plugin/out_secure_forward.rb:238:in `write_objects'
2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/output.rb:463:in `write'
2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/buffer.rb:325:in `write_chunk'
2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/buffer.rb:304:in `pop'
2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/output.rb:321:in `try_flush'
2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/output.rb:140:in `run'
2015-12-19 17:29:15 +0000 [warn]: retry succeeded. plugin_id="object:3f96dc42eafc"
2015-12-19 17:29:15 +0000 [warn]: disconnected from x.x.x.x
2015-12-19 21:55:40 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2015-12-19 21:52:20 +0000 error_class="Errno::ECONNRESET" error="Connection reset by peer" plugin_id="object:3f96dc42eafc"
2015-12-19 21:55:40 +0000 [warn]: suppressed same stacktrace
2015-12-19 21:55:40 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2015-12-19 21:52:40 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" plugin_id="object:3f96dc42eafc"
2015-12-19 21:55:40 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-secure-forward-0.3.2/lib/fluent/plugin/out_secure_forward.rb:233:in `write_objects'
2015-12-19 21:55:40 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/output.rb:463:in `write'
2015-12-19 21:55:40 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/buffer.rb:325:in `write_chunk'
2015-12-19 21:55:40 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/buffer.rb:304:in `pop'
2015-12-19 21:55:40 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/output.rb:321:in `try_flush'
2015-12-19 21:55:40 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/output.rb:140:in `run'
2015-12-19 21:55:40 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2015-12-19 21:53:18 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" plugin_id="object:3f96dc42eafc"
2015-12-19 21:55:40 +0000 [warn]: suppressed same stacktrace
2015-12-19 21:55:40 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2015-12-19 21:54:33 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" plugin_id="object:3f96dc42eafc"
2015-12-19 21:55:40 +0000 [warn]: suppressed same stacktrace
2015-12-19 21:55:40 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2015-12-19 21:57:31 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" plugin_id="object:3f96dc42eafc"
2015-12-19 21:55:40 +0000 [warn]: suppressed same stacktrace
2015-12-19 21:55:40 +0000 [warn]: disconnected from x2.x.x.x
2015-12-19 21:57:31 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2015-12-19 22:02:22 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" plugin_id="object:3f96dc42eafc"
2015-12-19 21:57:31 +0000 [warn]: suppressed same stacktrace
Isn't it possible that this while
loop never ends? If either SystemCallError
or EOFError
hit, shouldn't @writing
be set to false so that the socket will be shutdown?
Using td-agent v2.2.0 and secure-forward v0.3.2, for some reason our server got disconnected and secure-forward never recovered. I kept seeing "no one nodes with valid ssl session" error messages. This looks exactly like issue #7 again: