out_secure_forward blocks when the receiver node restarts

kiyoto commented 10 years ago

Consider the following setup with td-agent 2.1 + secure_forward v0.2.0

 -------------------------      ----------------------------
|(A) secure forward sender| -->|(B) secure forward receiver |
 -------------------------      ----------------------------

(A)'s td-agent.conf looks like this:

<source>
  type forward
</source>
<match secure.**>
  type secure_forward
  shared_key SHARED_KEY
  self_hostname ${hostname}
  <server>
    host HOST_IP
    port 24284
  </server>
</match>

and (B)'s td-agent.conf looks like this:

<source>
  type secure_forward
  shared_key 13ihudfwuh123ji
  self_hostname server.fqdn.local
  cert_auto_generate yes
</source>
<match secure.**>
  type stdout
</match>

Suppose both (A) and (B) are running. When (B) is restarted, (A) gets stuck trying to reconnect to (B).

See here for (A)'s td-agent.log as well as sigdump output.

I could reproduce this issue with CentOS 6 for (A) and Ubuntu 12.04 for (B), but the same problem has been reported with CentOS -> CentOS as well. See this thread on the mailing list.

tagomoris commented 10 years ago

Does this issue occur with Ruby 2.1 + Fluentd v0.10.x or not?

eredding-rmn commented 10 years ago

This is basically the same behavior as this issue: https://github.com/tagomoris/fluent-plugin-secure-forward/issues/7

I will say that in low-throughput clients, we see recovery after 100+ retries (I have retry count set to 1k right now because of these issues), but higher throughput clients never seem to reconnect.

kiyoto commented 10 years ago

@tagomoris It's with Ruby 2.1.2. See https://github.com/treasure-data/omnibus-td-agent/blob/7c9a73e1f502e264ac0b3f7b45b4e8dd662f47e1/config/software/ruby.rb#L18

@eredding-rmn Good to know. I am trying to see if this problem occurs with Ruby 2.0 and earlier + Fluentd 0.10.x

tagomoris commented 10 years ago

@kiyoto @eredding-rmn I've just fixed some bugs of reconnecting to failed nodes, and released v0.2.1. Can you try that version in your environment?

kiyoto commented 10 years ago

@tagomoris. I checked and it works! Thanks for the fix.

by the way, what do you think about changing https://github.com/tagomoris/fluent-plugin-secure-forward/blob/master/lib/fluent/plugin/out_secure_forward.rb#L169 to INFO and not DEBUG? Right now, if you are running secure_forward in a non-debug environment, the log does not show if the reconnect took place successfully.

tagomoris commented 10 years ago

@kiyoto I understood that situation, but simple log level change is bad idea. That message will appear each time of re-connection, even if it is from keep-alive expiration. So a user ,who specify 30sec for keep-alive, will see that log message every 30 seconds. It is not good for many users. I'll fix logs in a bit different way later.

kiyoto commented 10 years ago

@tagomoris Got it. Anything that shows "it was disconnected...but it got reconnected now" would be awesome =)

tagomoris commented 10 years ago

I've change log messages for connection/disconnection, with warn level on c4536f9. (Log level warn for connection/disconnection is same with out_forward.) How about this?

2014-10-22 11:40:39 +0900 [warn]: disconnected from localhost
2014-10-22 11:40:41 +0900 [warn]: dead connection found: localhost, reconnecting...
2014-10-22 11:40:41 +0900 [warn]: failed to connect for secure-forward error_class=Errno::ECONNREFUSED error=#<Errno::ECONNREFUSED: Connection refused - connect(2) for "127.0.0.1" port 24284> host="localhost" address="127.0.0.1" port=24284
2014-10-22 11:40:56 +0900 [warn]: dead connection found: localhost, reconnecting...
2014-10-22 11:41:01 +0900 [warn]: recovered connection to dead node: localhost

eredding-rmn commented 10 years ago

@tagomoris the fix looks good; much more stable.

kiyoto commented 10 years ago

@tagomoris LGTM! Thanks a lot.

tagomoris commented 10 years ago

@eredding-rmn @kiyoto Thanks! v0.2.2 released with this log formats/levels.

tagomoris / fluent-plugin-secure-forward

out_secure_forward blocks when the receiver node restarts #9