tagomoris / fluent-plugin-secure-forward

Other
140 stars 30 forks source link

Getting "no one nodes with valid ssl session" error #7

Closed amagnus closed 10 years ago

amagnus commented 10 years ago

Hello,

I'm using version 0.1.6 of the plugin to collect logs. From time to time the node using SecureForwardOutput would stop sending logs and get this error:

2014-05-09 23:34:06 -0700 [warn]: temporarily failed to flush the buffer. next_retry=2014-05-09 23:34:21 -0700 error_class="RuntimeError" error="no one nodes with valid ssl session" instance=69908390552060
  2014-05-09 23:34:06 -0700 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.1.6/lib/fluent/plugin/out_secure_forward.rb:168:in `write_objects'
  2014-05-09 23:34:06 -0700 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/output.rb:449:in `write'
  2014-05-09 23:34:06 -0700 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/buffer.rb:296:in `write_chunk'
  2014-05-09 23:34:06 -0700 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/buffer.rb:276:in `pop'
  2014-05-09 23:34:06 -0700 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/output.rb:310:in `try_flush'
  2014-05-09 23:34:06 -0700 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/output.rb:132:in `run'
2014-05-09 23:34:10 -0700 [info]: dead connection found: kibana.clearslide.net, reconnecting...

Restarting td-agent on the node using SecureForwardInput fixes the problem.

What do you recommend?

tagomoris commented 10 years ago

Can you find any logs of SecureForwardInput on input node? Or, can you set log level into TRACE or DEBUG? (-vv or -v option for fluentd)

amagnus commented 10 years ago

Also seeing this while restarting td-agent on SecureForwardInput node:

2014-06-02 15:17:47 -0700 [warn]: unexpected error while shutting down error_class=ThreadError error=#<ThreadError: can't create Thread (11)>
  2014-06-02 15:17:47 -0700 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.1.6/lib/fluent/plugin/input_session.rb:22:in `initialize'
  2014-06-02 15:17:47 -0700 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.1.6/lib/fluent/plugin/input_session.rb:22:in `new'
  2014-06-02 15:17:47 -0700 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.1.6/lib/fluent/plugin/input_session.rb:22:in `initialize'
  2014-06-02 15:17:47 -0700 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.1.6/lib/fluent/plugin/in_secure_forward.rb:184:in `new'
  2014-06-02 15:17:47 -0700 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.1.6/lib/fluent/plugin/in_secure_forward.rb:184:in `block in run'
  2014-06-02 15:17:47 -0700 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.1.6/lib/fluent/plugin/in_secure_forward.rb:181:in `loop'
  2014-06-02 15:17:47 -0700 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.1.6/lib/fluent/plugin/in_secure_forward.rb:181:in `run'
amagnus commented 10 years ago

How can we set global log level to debug when using td-agent?

Cannot use per plugin log leveling as I'm on fluentd 0.10.41.

tagomoris commented 10 years ago

td-agent accepts command line options by sysconfig (for RHEL/CentOS)

# /etc/sysconfig/td-agent
TD_AGENT_ARGS="/usr/sbin/td-agent --group td-agent --log /var/log/td-agent/td-agent.log -vv"

On debian, it does not seems permitted...

tagomoris commented 10 years ago

I'm sorry that I missed about DAEMON_ARGS for debian td-agent.

# /etc/default/td-agent
DAEMON_ARGS="-vv"
tagomoris commented 10 years ago

This problem seems from memory troubles. Please check memory usage of the server with SecureForwardInput when it is in trouble.

amagnus commented 10 years ago

You think td-agent user would have a memory usage limit? Using centos.

tagomoris commented 10 years ago

td-agent (and ruby) doesn't have such options. Fluentd with heavy traffics are likely to use more memory by fragmentations (td-agent also have jemalloc to reduce fragmentations, but it's not perfect).

spsusbilla commented 10 years ago

Try lowering the keepalive value, for us, 2 mins of keepalive is the sweet spot.

amagnus commented 10 years ago

Going to try the 2min keepalive for a few days.

tagomoris commented 10 years ago

@spsusbilla Do you mean that short keepalive makes fluentd's memory usage w/ secure-forward lower? Can i ask versions of your fluentd and ruby?

spsusbilla commented 10 years ago

@tagomoris, sorry for not explaining before but lowering the keepalive to 2 mins eliminates the "no one nodes with valid ssl session" errors for us, thus ensuring that the server receives all logs/events sent. I'm not sure as to the memory usage of fluentd with the secure forward, but for us, the additional load the short keepalive puts into the client machines were acceptable to ensure data integrity.

By the way, we are using ruby version 1.9 and fluentd version 0.10.41.

tagomoris commented 10 years ago

@spsusbilla Thank you for explanation! I see.

spsusbilla commented 10 years ago

@tagomoris, not a problem. I hope it was of help, as we are forever grateful for the amazing plugin.

amagnus commented 10 years ago

Update: Still getting this error with the 2min keepalive, though it does not get stuck:

temporarily failed to flush the buffer. next_retry=2014-05-09 23:34:21 -0700 error_class="RuntimeError" error="no one nodes with valid ssl session" instance=69908390552060

Logged last night's occurrences on one SecureForwardOutput node with high traffic:

11:28pm
11:30pm
12:20am
1:08am
1:10am
2:00am
2:48am
3:38am
4:28am
5:18am
6:06am
6:56am
7:46am
7:48am
8:36am
8:38am
8:45am

We notice 50min and 2min intervals. 2min seems to be keepalive, but what could 50min be?

amagnus commented 10 years ago

49 SSL connections are established on the collector.

lsof -i | grep td-agent | wc -l

tagomoris commented 10 years ago

Hmm... Could you check the buffer instance id? (ex: instace=.... of error log) Are those instance ids are same or different? And, how many servers are configured? I wanna check your configuration...

I think that memory usage troubles may occur by buffer flooding from no nodes available. But it seems that some part of error handling in secure-forward does not work well.

amagnus commented 10 years ago

instance id is the same, until I restart td-agent then I see a different instance id.

There is about 60 fluentd nodes sending logs to a single fluentd server.

Let me know if you need the config of the node or server.

tagomoris commented 10 years ago

I see. I'll check code. Please wait for few days, but maybe config of input/output are very helpful for investigation.

amagnus commented 10 years ago

Any idea of what else we can check?

tagomoris commented 10 years ago

Sorry, I'm now busy and don't touch this problem yet.

tagomoris commented 10 years ago

@amagnus Can you enable debug logs on SecureForwardOutput nodes & SecureForwardInput nodes? And please let me see your configuration sections of secure-input and output.

amagnus commented 10 years ago

secure-output:

<match log.jvm>
  type copy
  <store>
    type secure_forward
    shared_key    ************
    self_hostname    ${hostname}
    keepalive 120

    <server>
      host ***************
    </server>

    buffer_type file
    buffer_path /var/log/td-agent/myapp-forward
    retry_limit 50
    flush_interval 20s
  </store>
</match>

secure-input:

<source>
  type secure_forward
  shared_key         ****************
  self_hostname      **************
  cert_auto_generate yes
</source>

Also enabled debug log.

tagomoris commented 10 years ago

I'm checking code, but cannot find any obvious bugs or problems. It may be better to try to remove keepalive, and I'm now writing that patch.

tagomoris commented 10 years ago

I've just released v0.1.9.pre.rc1 with bug fixes for keepalive connection handling. With this fix, there are less troubles on reconnecting of keepalive connections.

@amagnus Cloud you test this version? This prerelease gem can be installed by gem install --prerelease fluent-plugin-secure-forward. NOTE This gem depends on Fluentd v0.10.46 or later.

tagomoris commented 10 years ago

@amagnus ping?

kytu800 commented 10 years ago

Same problem here! I use configuration like below:

Client

<match p.**>
   type secure_forward
   self_hostname ${hostname}
   keepalive 120
   shared_key somesecretkey
   <server>
     host fluentd.mydomain.com
     port 30001
   </server>
</match>

I can send log most of the time, but when I restart td-agent service on "fluentd.mydomain.com" the client td-agent show

2014-09-04 01:58:12 +0000 [warn]: disconnected from fluentd.mydomain.com
2014-09-04 01:58:13 +0000 [info]: dead connection found: fluentd.mydomain.com, reconnecting...

but never get reconnected logs never sent out and keep getting below message what can i do?

2014-09-04 01:59:22 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2014-09-04 01:59:22 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" instance=69869851371920
  2014-09-04 01:59:22 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.2.0/lib/fluent/plugin/out_secure_forward.rb:212:in `write_objects'
  2014-09-04 01:59:22 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:449:in `write'
  2014-09-04 01:59:22 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:296:in `write_chunk'
  2014-09-04 01:59:22 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:276:in `pop'
  2014-09-04 01:59:22 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:310:in `try_flush'
  2014-09-04 01:59:22 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:132:in `run'
2014-09-04 01:59:23 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2014-09-04 01:59:25 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" instance=69869851371920
  2014-09-04 01:59:23 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.2.0/lib/fluent/plugin/out_secure_forward.rb:212:in `write_objects'
  2014-09-04 01:59:23 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:449:in `write'
  2014-09-04 01:59:23 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:296:in `write_chunk'
  2014-09-04 01:59:23 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:276:in `pop'
  2014-09-04 01:59:23 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:310:in `try_flush'
  2014-09-04 01:59:23 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:132:in `run'
2014-09-04 01:59:26 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2014-09-04 01:59:29 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" instance=69869851371920
  2014-09-04 01:59:26 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.2.0/lib/fluent/plugin/out_secure_forward.rb:212:in `write_objects'
  2014-09-04 01:59:26 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:449:in `write'
  2014-09-04 01:59:26 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:296:in `write_chunk'
  2014-09-04 01:59:26 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:276:in `pop'
  2014-09-04 01:59:26 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:310:in `try_flush'
  2014-09-04 01:59:26 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:132:in `run'
2014-09-04 01:59:30 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2014-09-04 01:59:36 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" instance=69869851371920
  2014-09-04 01:59:30 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.2.0/lib/fluent/plugin/out_secure_forward.rb:212:in `write_objects'
  2014-09-04 01:59:30 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:449:in `write'
  2014-09-04 01:59:30 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:296:in `write_chunk'
  2014-09-04 01:59:30 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:276:in `pop'
  2014-09-04 01:59:30 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:310:in `try_flush'
  2014-09-04 01:59:30 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:132:in `run'
2014-09-04 01:59:37 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2014-09-04 01:59:51 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" instance=69869851371920
  2014-09-04 01:59:37 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.2.0/lib/fluent/plugin/out_secure_forward.rb:212:in `write_objects'
  2014-09-04 01:59:37 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:449:in `write'
  2014-09-04 01:59:37 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:296:in `write_chunk'
  2014-09-04 01:59:37 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:276:in `pop'
  2014-09-04 01:59:37 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:310:in `try_flush'
  2014-09-04 01:59:37 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:132:in `run'
2014-09-04 01:59:52 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2014-09-04 02:00:19 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" instance=69869851371920
  2014-09-04 01:59:52 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.2.0/lib/fluent/plugin/out_secure_forward.rb:212:in `write_objects'
  2014-09-04 01:59:52 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:449:in `write'
  2014-09-04 01:59:52 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:296:in `write_chunk'
  2014-09-04 01:59:52 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:276:in `pop'
  2014-09-04 01:59:52 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:310:in `try_flush'
  2014-09-04 01:59:52 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:132:in `run'
eredding-rmn commented 10 years ago

I've had a lot of problems with this in very high-throughput, insecure environments. I can repeatably create a scenario in which td-agent never recovers with this simple test:

I've observed there's always a set of buffer files that never seem to be parsed - it's as if the daemon can't keep up. Also, the connection that shows in netstat goes away, while there are still messages being forwarded - I'd say 1/2 of the messages are being sent, and the other half, buffered. The retry logic seems to be the issue here. I've also noticed some messages end up being broken in strange places once stack traces start being forwarded.

If you restart td-agent, the buffer files are read off really fast and service resumes.

tagomoris commented 10 years ago

Thank you for great steps to reproduce! I'll investigate later. I'm sorry that I have no time for this issue now.

tagomoris commented 10 years ago

@amagnus @eredding-rmn @kytu800 @spsusbilla Hi all, After some reports and investigation, I found some bugs in re-connecting code to failed node, fixed it and released v0.2.1. Can you try that new version in your environment?

eredding-rmn commented 10 years ago

@tagomoris, as mentioned in the other bug report, the fix pushed out looks good.

tagomoris commented 10 years ago

OK, I'll close this issue. Please re-open for another reproduced cases. Thank you for many reports and tests.