Closed amagnus closed 10 years ago
Can you find any logs of SecureForwardInput on input node? Or, can you set log level into TRACE or DEBUG? (-vv or -v option for fluentd)
Also seeing this while restarting td-agent on SecureForwardInput node:
2014-06-02 15:17:47 -0700 [warn]: unexpected error while shutting down error_class=ThreadError error=#<ThreadError: can't create Thread (11)>
2014-06-02 15:17:47 -0700 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.1.6/lib/fluent/plugin/input_session.rb:22:in `initialize'
2014-06-02 15:17:47 -0700 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.1.6/lib/fluent/plugin/input_session.rb:22:in `new'
2014-06-02 15:17:47 -0700 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.1.6/lib/fluent/plugin/input_session.rb:22:in `initialize'
2014-06-02 15:17:47 -0700 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.1.6/lib/fluent/plugin/in_secure_forward.rb:184:in `new'
2014-06-02 15:17:47 -0700 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.1.6/lib/fluent/plugin/in_secure_forward.rb:184:in `block in run'
2014-06-02 15:17:47 -0700 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.1.6/lib/fluent/plugin/in_secure_forward.rb:181:in `loop'
2014-06-02 15:17:47 -0700 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.1.6/lib/fluent/plugin/in_secure_forward.rb:181:in `run'
How can we set global log level to debug when using td-agent?
Cannot use per plugin log leveling as I'm on fluentd 0.10.41.
td-agent accepts command line options by sysconfig (for RHEL/CentOS)
# /etc/sysconfig/td-agent
TD_AGENT_ARGS="/usr/sbin/td-agent --group td-agent --log /var/log/td-agent/td-agent.log -vv"
On debian, it does not seems permitted...
I'm sorry that I missed about DAEMON_ARGS for debian td-agent.
# /etc/default/td-agent
DAEMON_ARGS="-vv"
This problem seems from memory troubles. Please check memory usage of the server with SecureForwardInput when it is in trouble.
You think td-agent user would have a memory usage limit? Using centos.
td-agent (and ruby) doesn't have such options. Fluentd with heavy traffics are likely to use more memory by fragmentations (td-agent also have jemalloc to reduce fragmentations, but it's not perfect).
Try lowering the keepalive value, for us, 2 mins of keepalive is the sweet spot.
Going to try the 2min keepalive for a few days.
@spsusbilla Do you mean that short keepalive makes fluentd's memory usage w/ secure-forward lower? Can i ask versions of your fluentd and ruby?
@tagomoris, sorry for not explaining before but lowering the keepalive to 2 mins eliminates the "no one nodes with valid ssl session" errors for us, thus ensuring that the server receives all logs/events sent. I'm not sure as to the memory usage of fluentd with the secure forward, but for us, the additional load the short keepalive puts into the client machines were acceptable to ensure data integrity.
By the way, we are using ruby version 1.9 and fluentd version 0.10.41.
@spsusbilla Thank you for explanation! I see.
@tagomoris, not a problem. I hope it was of help, as we are forever grateful for the amazing plugin.
Update: Still getting this error with the 2min keepalive, though it does not get stuck:
temporarily failed to flush the buffer. next_retry=2014-05-09 23:34:21 -0700 error_class="RuntimeError" error="no one nodes with valid ssl session" instance=69908390552060
Logged last night's occurrences on one SecureForwardOutput node with high traffic:
11:28pm
11:30pm
12:20am
1:08am
1:10am
2:00am
2:48am
3:38am
4:28am
5:18am
6:06am
6:56am
7:46am
7:48am
8:36am
8:38am
8:45am
We notice 50min and 2min intervals. 2min seems to be keepalive, but what could 50min be?
49 SSL connections are established on the collector.
lsof -i | grep td-agent | wc -l
Hmm...
Could you check the buffer instance id? (ex: instace=....
of error log) Are those instance ids are same or different?
And, how many servers are configured? I wanna check your configuration...
I think that memory usage troubles may occur by buffer flooding from no nodes available
.
But it seems that some part of error handling in secure-forward does not work well.
instance id is the same, until I restart td-agent then I see a different instance id.
There is about 60 fluentd nodes sending logs to a single fluentd server.
Let me know if you need the config of the node or server.
I see. I'll check code. Please wait for few days, but maybe config of input/output are very helpful for investigation.
Any idea of what else we can check?
Sorry, I'm now busy and don't touch this problem yet.
@amagnus Can you enable debug logs on SecureForwardOutput nodes & SecureForwardInput nodes? And please let me see your configuration sections of secure-input and output.
secure-output:
<match log.jvm>
type copy
<store>
type secure_forward
shared_key ************
self_hostname ${hostname}
keepalive 120
<server>
host ***************
</server>
buffer_type file
buffer_path /var/log/td-agent/myapp-forward
retry_limit 50
flush_interval 20s
</store>
</match>
secure-input:
<source>
type secure_forward
shared_key ****************
self_hostname **************
cert_auto_generate yes
</source>
Also enabled debug log.
I'm checking code, but cannot find any obvious bugs or problems. It may be better to try to remove keepalive, and I'm now writing that patch.
I've just released v0.1.9.pre.rc1
with bug fixes for keepalive connection handling. With this fix, there are less troubles on reconnecting of keepalive connections.
@amagnus Cloud you test this version? This prerelease gem can be installed by gem install --prerelease fluent-plugin-secure-forward
. NOTE This gem depends on Fluentd v0.10.46 or later.
@amagnus ping?
Same problem here! I use configuration like below:
<match p.**> type secure_forward self_hostname ${hostname} keepalive 120 shared_key somesecretkey <server> host fluentd.mydomain.com port 30001 </server> </match>
I can send log most of the time, but when I restart td-agent service on "fluentd.mydomain.com" the client td-agent show
2014-09-04 01:58:12 +0000 [warn]: disconnected from fluentd.mydomain.com 2014-09-04 01:58:13 +0000 [info]: dead connection found: fluentd.mydomain.com, reconnecting...
but never get reconnected logs never sent out and keep getting below message what can i do?
2014-09-04 01:59:22 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2014-09-04 01:59:22 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" instance=69869851371920 2014-09-04 01:59:22 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.2.0/lib/fluent/plugin/out_secure_forward.rb:212:in `write_objects' 2014-09-04 01:59:22 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:449:in `write' 2014-09-04 01:59:22 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:296:in `write_chunk' 2014-09-04 01:59:22 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:276:in `pop' 2014-09-04 01:59:22 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:310:in `try_flush' 2014-09-04 01:59:22 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:132:in `run' 2014-09-04 01:59:23 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2014-09-04 01:59:25 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" instance=69869851371920 2014-09-04 01:59:23 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.2.0/lib/fluent/plugin/out_secure_forward.rb:212:in `write_objects' 2014-09-04 01:59:23 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:449:in `write' 2014-09-04 01:59:23 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:296:in `write_chunk' 2014-09-04 01:59:23 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:276:in `pop' 2014-09-04 01:59:23 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:310:in `try_flush' 2014-09-04 01:59:23 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:132:in `run' 2014-09-04 01:59:26 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2014-09-04 01:59:29 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" instance=69869851371920 2014-09-04 01:59:26 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.2.0/lib/fluent/plugin/out_secure_forward.rb:212:in `write_objects' 2014-09-04 01:59:26 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:449:in `write' 2014-09-04 01:59:26 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:296:in `write_chunk' 2014-09-04 01:59:26 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:276:in `pop' 2014-09-04 01:59:26 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:310:in `try_flush' 2014-09-04 01:59:26 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:132:in `run' 2014-09-04 01:59:30 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2014-09-04 01:59:36 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" instance=69869851371920 2014-09-04 01:59:30 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.2.0/lib/fluent/plugin/out_secure_forward.rb:212:in `write_objects' 2014-09-04 01:59:30 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:449:in `write' 2014-09-04 01:59:30 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:296:in `write_chunk' 2014-09-04 01:59:30 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:276:in `pop' 2014-09-04 01:59:30 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:310:in `try_flush' 2014-09-04 01:59:30 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:132:in `run' 2014-09-04 01:59:37 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2014-09-04 01:59:51 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" instance=69869851371920 2014-09-04 01:59:37 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.2.0/lib/fluent/plugin/out_secure_forward.rb:212:in `write_objects' 2014-09-04 01:59:37 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:449:in `write' 2014-09-04 01:59:37 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:296:in `write_chunk' 2014-09-04 01:59:37 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:276:in `pop' 2014-09-04 01:59:37 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:310:in `try_flush' 2014-09-04 01:59:37 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:132:in `run' 2014-09-04 01:59:52 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2014-09-04 02:00:19 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" instance=69869851371920 2014-09-04 01:59:52 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-secure-forward-0.2.0/lib/fluent/plugin/out_secure_forward.rb:212:in `write_objects' 2014-09-04 01:59:52 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:449:in `write' 2014-09-04 01:59:52 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:296:in `write_chunk' 2014-09-04 01:59:52 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/buffer.rb:276:in `pop' 2014-09-04 01:59:52 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:310:in `try_flush' 2014-09-04 01:59:52 +0000 [warn]: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.50/lib/fluent/output.rb:132:in `run'
I've had a lot of problems with this in very high-throughput, insecure environments. I can repeatably create a scenario in which td-agent never recovers with this simple test:
I've observed there's always a set of buffer files that never seem to be parsed - it's as if the daemon can't keep up. Also, the connection that shows in netstat goes away, while there are still messages being forwarded - I'd say 1/2 of the messages are being sent, and the other half, buffered. The retry logic seems to be the issue here. I've also noticed some messages end up being broken in strange places once stack traces start being forwarded.
If you restart td-agent, the buffer files are read off really fast and service resumes.
Thank you for great steps to reproduce! I'll investigate later. I'm sorry that I have no time for this issue now.
@amagnus @eredding-rmn @kytu800 @spsusbilla Hi all, After some reports and investigation, I found some bugs in re-connecting code to failed node, fixed it and released v0.2.1. Can you try that new version in your environment?
@tagomoris, as mentioned in the other bug report, the fix pushed out looks good.
OK, I'll close this issue. Please re-open for another reproduced cases. Thank you for many reports and tests.
Hello,
I'm using version 0.1.6 of the plugin to collect logs. From time to time the node using SecureForwardOutput would stop sending logs and get this error:
Restarting td-agent on the node using SecureForwardInput fixes the problem.
What do you recommend?