Open ThomKoomen opened 7 years ago
Hi, I have exactly same problem, only that there is no network problem and restarting td-agent service does not help at all. The trace fills logs right after start. As this continues, td-agent consumes all available memory and then oom_killer comes in and starts shooting.
My environment: fluent-plugin-beats 0.1.3 td-agent 1.0.2 OS: Red Hat Enterprise Linux Server release 7.4
Hi @repeatedly ,
Did you have time to look into this? We are still experiencing this issue with fluentd and the fluent-plugin-beats v1.0.0. It does not matter how many nodes send metricbeat data towards fluentd.
We've got multiple environments that each have the same setup regarding fluentd and only one (mostly stack vm's (vcenter) environment. We're experiencing the most problems there.
It has multiple sites that can contain between 10 - 300 nodes and either of those sites shuts down randomly, but all with the same exact issues as described by Thom Koomen.
We're using multiple threads, writing the buffer to files but nothing seems to help.
Do you have any suggestions in regards to what we can look at to try and resolve the issue?
Kind regards,
Teun Roefs
@TeunRoefs Which problem happens on your environment? Shutdown is blocked or connection handling is not recovered after error? fluentd v1.x and beats plugin v1.x seems to not have former problem so latter error, right?
@repeatedly Seems its the latter error, we've tried multiple versions but the errors seems to persist. We're also looking into another plugin that might be the issue: https://github.com/htgc/fluent-plugin-azureeventhubs
I'll come back to you once I have more information.
@TeunRoefs I see. To debug the problem, I want to reproduce this bad fd problem but hard to it in my laptop. So I want to know how to reproduce this problem.
If closing bad socket resolves the problem, adding Errno::EBADF to error list may be a workaround.
@repeatedly What we do to reproduce this issue, we created a fluentd-data-collector VM that uses the fluent-plugin-beats and disable its connection (clean resolv.conf) onwards to the Azure EventHub that we're using. This causes Fluentd to 'hang' and metricbeat cannot send its data to Fluentd anymore. Funny thing is, metricbeat on localhost cannot even send its data to Fluent.
We re-enable the network connection (adding values to resolv.conf) and this should allow traffic to pass through, however, this is not the case.
When network issues occur while using td-agent with the fluent-plugin-beats plugin, it stops sending data, throwing the below exception(s) and stacktrace several times. This is, ofcourse, expected behaviour, more so it is really helpful. However, after these network issues (automatically) recover, the plugin does not resume sending data. When trying to restart td-agent (
service td-agent restart
) it will hang indefinitely, requiring killing of the td-agent processes. After eventually managing to get td-agent back up again, it will run without problems again.td-agent log:
metricbeat log
metricbeat config:
td-agent config:
After looking at the code, in an effort trying to fix it myself, I feel like the error handling may possibly need a
retry
and I suspect the@thread.join
to be preventing td-agent from properly restarting, however I'm not that experienced with Ruby so I may be completely wrong. While waiting for a response I'll keep trying to fix it and will try to keep this ticket updated.