Closed kiyoto closed 10 years ago
Does this issue occur with Ruby 2.1 + Fluentd v0.10.x or not?
This is basically the same behavior as this issue: https://github.com/tagomoris/fluent-plugin-secure-forward/issues/7
I will say that in low-throughput clients, we see recovery after 100+ retries (I have retry count set to 1k right now because of these issues), but higher throughput clients never seem to reconnect.
@tagomoris It's with Ruby 2.1.2. See https://github.com/treasure-data/omnibus-td-agent/blob/7c9a73e1f502e264ac0b3f7b45b4e8dd662f47e1/config/software/ruby.rb#L18
@eredding-rmn Good to know. I am trying to see if this problem occurs with Ruby 2.0 and earlier + Fluentd 0.10.x
@kiyoto @eredding-rmn I've just fixed some bugs of reconnecting to failed nodes, and released v0.2.1. Can you try that version in your environment?
@tagomoris. I checked and it works! Thanks for the fix.
by the way, what do you think about changing https://github.com/tagomoris/fluent-plugin-secure-forward/blob/master/lib/fluent/plugin/out_secure_forward.rb#L169 to INFO and not DEBUG? Right now, if you are running secure_forward in a non-debug environment, the log does not show if the reconnect took place successfully.
@kiyoto I understood that situation, but simple log level change is bad idea. That message will appear each time of re-connection, even if it is from keep-alive expiration. So a user ,who specify 30sec for keep-alive, will see that log message every 30 seconds. It is not good for many users. I'll fix logs in a bit different way later.
@tagomoris Got it. Anything that shows "it was disconnected...but it got reconnected now" would be awesome =)
I've change log messages for connection/disconnection, with warn level on c4536f9.
(Log level warn for connection/disconnection is same with out_forward
.)
How about this?
2014-10-22 11:40:39 +0900 [warn]: disconnected from localhost
2014-10-22 11:40:41 +0900 [warn]: dead connection found: localhost, reconnecting...
2014-10-22 11:40:41 +0900 [warn]: failed to connect for secure-forward error_class=Errno::ECONNREFUSED error=#<Errno::ECONNREFUSED: Connection refused - connect(2) for "127.0.0.1" port 24284> host="localhost" address="127.0.0.1" port=24284
2014-10-22 11:40:56 +0900 [warn]: dead connection found: localhost, reconnecting...
2014-10-22 11:41:01 +0900 [warn]: recovered connection to dead node: localhost
@tagomoris the fix looks good; much more stable.
@tagomoris LGTM! Thanks a lot.
@eredding-rmn @kiyoto Thanks! v0.2.2 released with this log formats/levels.
Consider the following setup with td-agent 2.1 + secure_forward v0.2.0
(A)'s td-agent.conf looks like this:
and (B)'s td-agent.conf looks like this:
Suppose both (A) and (B) are running. When (B) is restarted, (A) gets stuck trying to reconnect to (B).
See here for (A)'s td-agent.log as well as sigdump output.
I could reproduce this issue with CentOS 6 for (A) and Ubuntu 12.04 for (B), but the same problem has been reported with CentOS -> CentOS as well. See this thread on the mailing list.