tagomoris / fluent-plugin-secure-forward

Other
140 stars 30 forks source link

write_objects fails with `no one nodes with valid SSL session` #21

Closed stanhu closed 8 years ago

stanhu commented 9 years ago

Using td-agent v2.2.0 and secure-forward v0.3.2, for some reason our server got disconnected and secure-forward never recovered. I kept seeing "no one nodes with valid ssl session" error messages. This looks exactly like issue #7 again:

2015-06-22 19:08:12 +0000 [warn]: disconnected from server.redacted.com
2015-06-22 19:08:15 +0000 [warn]: dead connection found: server.redacted.com, reconnecting...
2015-06-22 19:08:20 +0000 [warn]: recovered connection to dead node: server.redacted.com
2015-06-22 19:09:11 +0000 [warn]: disconnected from server.redacted.com
2015-06-22 19:09:15 +0000 [warn]: dead connection found: server.redacted.com, reconnecting...
2015-06-22 19:09:20 +0000 [warn]: recovered connection to dead node: server.redacted.com
2015-06-22 19:23:55 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2015-06-22 19:09:16 +0000 error_class="Errno::ETIMEDOUT" error="Connection timed out" plugin_id="object:3ff55d39b574"
  2015-06-22 19:23:55 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/2.1.0/openssl/buffering.rb:326:in `syswrite'
  2015-06-22 19:23:55 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/2.1.0/openssl/buffering.rb:326:in `do_write'
  2015-06-22 19:23:55 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/2.1.0/openssl/buffering.rb:344:in `write'
  2015-06-22 19:23:55 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/plugin/buf_memory.rb:55:in `write_to'
2015-06-22 19:23:55 +0000 [warn]: disconnected from server2.redacted.com
  2015-06-22 19:23:55 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-secure-forward-0.3.2/lib/fluent/plugin/out_secure_forward.rb:274:in `send_data'
  2015-06-22 19:23:55 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-secure-forward-0.3.2/lib/fluent/plugin/out_secure_forward.rb:238:in `write_objects'
  2015-06-22 19:23:55 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:463:in `write'
  2015-06-22 19:23:55 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/buffer.rb:325:in `write_chunk'
  2015-06-22 19:23:55 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/buffer.rb:304:in `pop'
  2015-06-22 19:23:55 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:321:in `try_flush'
  2015-06-22 19:23:55 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:140:in `run'
2015-06-22 19:23:55 +0000 [warn]: retry succeeded. plugin_id="object:3ff55d39b574"
2015-06-22 19:24:09 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2015-06-22 19:24:10 +0000 error_class="Errno::ECONNRESET" error="Connection reset by peer" plugin_id="object:3ff55d39b574"
  2015-06-22 19:24:09 +0000 [warn]: suppressed same stacktrace
2015-06-22 19:24:09 +0000 [warn]: disconnected from server.redacted.com
2015-06-22 19:24:11 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2015-06-22 19:24:12 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" plugin_id="object:3ff55d39b574"
  2015-06-22 19:24:11 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-secure-forward-0.3.2/lib/fluent/plugin/out_secure_forward.rb:233:in `write_objects'
  2015-06-22 19:24:11 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:463:in `write'
  2015-06-22 19:24:11 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/buffer.rb:325:in `write_chunk'
  2015-06-22 19:24:11 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/buffer.rb:304:in `pop'
  2015-06-22 19:24:11 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:321:in `try_flush'
  2015-06-22 19:24:11 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:140:in `run'
2015-06-22 19:24:13 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2015-06-22 19:24:15 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" plugin_id="object:3ff55d39b574"
stanhu commented 9 years ago

If I just look for connection messages, it seems that either td-agent or the plugin isn't even attempting to reconnect after the first connection:

$ /var/log/td-agent sudo zcat td-agent.log.2.gz | grep conn
2015-06-22 19:08:12 +0000 [warn]: disconnected from server.redacted.com
2015-06-22 19:08:15 +0000 [warn]: dead connection found: server.redacted.com, reconnecting...
2015-06-22 19:08:20 +0000 [warn]: recovered connection to dead node: server.redacted.com
2015-06-22 19:09:11 +0000 [warn]: disconnected from server.redacted.com
2015-06-22 19:09:15 +0000 [warn]: dead connection found: server.redacted.com, reconnecting...
2015-06-22 19:09:20 +0000 [warn]: recovered connection to dead node: server.redacted.com
2015-06-22 19:23:55 +0000 [warn]: disconnected from server2.redacted.com
2015-06-22 19:24:09 +0000 [warn]: disconnected from server.redacted.com
tagomoris commented 9 years ago

Can you see debug/trace logs? These can be turned on by using -vv option of Fluentd, specified by TD_AGENT_ARGS or DAEMON_ARGS for td-agent.

tagomoris commented 8 years ago

I'll close this at once. Please open another issue for further problems.

stanhu commented 8 years ago

I'm seeing this issue again. I have two fluentd aggregators, each receiving 50% of the traffic. When I restart one of them, all my clients get into this state and stop sending data. I disabled keepalive because this was causing an inordinate amount of connections/reconnections and causing lots of sockets on the server to get into CLOSE_WAIT.

Here is the log messages from one of the clients that get stuck:

2015-12-19 17:29:15 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2015-12-19 17:28:40 +0000 error_class="Errno::ECONNRESET" error="Connection reset by peer" plugin_id="object:3f96dc42eafc"
  2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/2.1.0/openssl/buffering.rb:326:in `syswrite'
  2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/2.1.0/openssl/buffering.rb:326:in `do_write'
  2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/2.1.0/openssl/buffering.rb:344:in `write'
  2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/buf_memory.rb:55:in `write_to'
  2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-secure-forward-0.3.2/lib/fluent/plugin/out_secure_forward.rb:274:in `send_data'
  2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-secure-forward-0.3.2/lib/fluent/plugin/out_secure_forward.rb:238:in `write_objects'
  2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/output.rb:463:in `write'
  2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/buffer.rb:325:in `write_chunk'
  2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/buffer.rb:304:in `pop'
  2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/output.rb:321:in `try_flush'
  2015-12-19 17:29:15 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/output.rb:140:in `run'
2015-12-19 17:29:15 +0000 [warn]: retry succeeded. plugin_id="object:3f96dc42eafc"
2015-12-19 17:29:15 +0000 [warn]: disconnected from x.x.x.x
2015-12-19 21:55:40 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2015-12-19 21:52:20 +0000 error_class="Errno::ECONNRESET" error="Connection reset by peer" plugin_id="object:3f96dc42eafc"
  2015-12-19 21:55:40 +0000 [warn]: suppressed same stacktrace
2015-12-19 21:55:40 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2015-12-19 21:52:40 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" plugin_id="object:3f96dc42eafc"
  2015-12-19 21:55:40 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-secure-forward-0.3.2/lib/fluent/plugin/out_secure_forward.rb:233:in `write_objects'
  2015-12-19 21:55:40 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/output.rb:463:in `write'
  2015-12-19 21:55:40 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/buffer.rb:325:in `write_chunk'
  2015-12-19 21:55:40 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/buffer.rb:304:in `pop'
  2015-12-19 21:55:40 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/output.rb:321:in `try_flush'
  2015-12-19 21:55:40 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/output.rb:140:in `run'
2015-12-19 21:55:40 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2015-12-19 21:53:18 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" plugin_id="object:3f96dc42eafc"
  2015-12-19 21:55:40 +0000 [warn]: suppressed same stacktrace
2015-12-19 21:55:40 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2015-12-19 21:54:33 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" plugin_id="object:3f96dc42eafc"
  2015-12-19 21:55:40 +0000 [warn]: suppressed same stacktrace
2015-12-19 21:55:40 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2015-12-19 21:57:31 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" plugin_id="object:3f96dc42eafc"
  2015-12-19 21:55:40 +0000 [warn]: suppressed same stacktrace
2015-12-19 21:55:40 +0000 [warn]: disconnected from x2.x.x.x
2015-12-19 21:57:31 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2015-12-19 22:02:22 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" plugin_id="object:3f96dc42eafc"
  2015-12-19 21:57:31 +0000 [warn]: suppressed same stacktrace
stanhu commented 8 years ago

Isn't it possible that this while loop never ends? If either SystemCallError or EOFError hit, shouldn't @writing be set to false so that the socket will be shutdown?

https://github.com/tagomoris/fluent-plugin-secure-forward/blob/master/lib/fluent/plugin/output_node.rb#L343