Plugin does not recover after network issues

ThomKoomen commented 7 years ago

When network issues occur while using td-agent with the fluent-plugin-beats plugin, it stops sending data, throwing the below exception(s) and stacktrace several times. This is, ofcourse, expected behaviour, more so it is really helpful. However, after these network issues (automatically) recover, the plugin does not resume sending data. When trying to restart td-agent (service td-agent restart) it will hang indefinitely, requiring killing of the td-agent processes. After eventually managing to get td-agent back up again, it will run without problems again.

td-agent log:

2017-10-24 00:34:16 +0000 [error]: unexpected error error="Resource temporarily unavailable"
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/lumberjack/beats/server.rb:342:in `sysread'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/lumberjack/beats/server.rb:342:in `read_socket'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/lumberjack/beats/server.rb:319:in `run'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/fluent/plugin/in_beats.rb:97:in `block in run'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:331:in `call'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:331:in `run_task'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:320:in `block (3 levels) in create_worker'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:303:in `loop'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:303:in `block (2 levels) in create_worker'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:302:in `catch'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:302:in `block in create_worker'
2017-10-24 00:34:16 +0000 [error]: unexpected error error="Bad file descriptor"
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/lumberjack/beats/server.rb:342:in `sysread'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/lumberjack/beats/server.rb:342:in `read_socket'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/lumberjack/beats/server.rb:319:in `run'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/fluent/plugin/in_beats.rb:97:in `block in run'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:331:in `call'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:331:in `run_task'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:320:in `block (3 levels) in create_worker'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:303:in `loop'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:303:in `block (2 levels) in create_worker'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:302:in `catch'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:302:in `block in create_worker'

metricbeat log

2017-10-24T02:34:35+02:00 INFO Non-zero metrics in the last 30s: fetches.system-cpu.events=1 fetches.system-cpu.success=1 fetches.system-diskio.events=6 fetches.system-diskio.success=1 fetches.system-filesystem.events=2 fetches.system-fi
lesystem.success=1 fetches.system-memory.events=1 fetches.system-memory.success=1 fetches.system-network.events=2 fetches.system-network.success=1 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=1276 libb
eat.publisher.messages_in_worker_queues=10 libbeat.publisher.published_events=10
2017-10-24T02:34:36+02:00 ERR Failed to publish events caused by: read tcp 161.86.62.204:36038->161.86.62.206:5044: i/o timeout
2017-10-24T02:34:36+02:00 INFO Error publishing events (retrying): read tcp 161.86.62.204:36038->161.86.62.206:5044: i/o timeout
2017-10-24T02:35:05+02:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_errors=1 libbeat.logstash.publish.write_bytes=801 libbeat.logstash.published_but_not_acked_events=
10
2017-10-24T02:35:07+02:00 ERR Failed to publish events caused by: read tcp 161.86.62.204:54702->161.86.62.206:5044: i/o timeout
2017-10-24T02:35:07+02:00 INFO Error publishing events (retrying): read tcp 161.86.62.204:54702->161.86.62.206:5044: i/o timeout
2017-10-24T02:35:13+02:00 ERR Failed to publish events caused by: read tcp 161.86.62.204:54922->161.86.62.206:5044: read: connection reset by peer
2017-10-24T02:35:13+02:00 INFO Error publishing events (retrying): read tcp 161.86.62.204:54922->161.86.62.206:5044: read: connection reset by peer
2017-10-24T02:35:23+02:00 ERR Failed to publish events caused by: read tcp 161.86.62.204:54992->161.86.62.206:5044: read: connection reset by peer
2017-10-24T02:35:23+02:00 INFO Error publishing events (retrying): read tcp 161.86.62.204:54992->161.86.62.206:5044: read: connection reset by peer
2017-10-24T02:35:35+02:00 INFO Non-zero metrics in the last 30s: fetches.system-cpu.events=1 fetches.system-cpu.success=1 fetches.system-diskio.events=6 fetches.system-diskio.success=1 fetches.system-filesystem.events=2 fetches.system-fi
lesystem.success=1 fetches.system-memory.events=1 fetches.system-memory.success=1 fetches.system-network.events=2 fetches.system-network.success=1 libbeat.logstash.call_count.PublishEvents=3 libbeat.logstash.publish.read_bytes=24 libbeat
.logstash.publish.read_errors=3 libbeat.logstash.publish.write_bytes=2700 libbeat.logstash.published_and_acked_events=10 libbeat.logstash.published_but_not_acked_events=30 libbeat.outputs.messages_dropped=1 libbeat.publisher.messages_in_
worker_queues=10 libbeat.publisher.published_events=10
2017-10-24T02:36:05+02:00 INFO No non-zero metrics in the last 30s
2017-10-24T02:36:35+02:00 INFO Non-zero metrics in the last 30s: fetches.system-cpu.events=1 fetches.system-cpu.success=1 fetches.system-diskio.events=6 fetches.system-diskio.success=1 fetches.system-filesystem.events=2 fetches.system-fi
lesystem.success=1 fetches.system-memory.events=1 fetches.system-memory.success=1 fetches.system-network.events=2 fetches.system-network.success=1 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=60 libbeat
.logstash.publish.write_bytes=1566 libbeat.logstash.published_and_acked_events=10 libbeat.publisher.messages_in_worker_queues=10 libbeat.publisher.published_events=10

metricbeat config:

###################### Metricbeat Configuration #######################

#==========================  Modules configuration ============================
metricbeat.modules:

#------------------------------- System Module -------------------------------
- module: system
  metricsets:
    - cpu
    - diskio
    - filesystem
    - network
    - memory
  enabled: true
  period: 60s
  filters:
    - drop_event.when.regexp.mount_point: '^/(sys|cgroup|proc|dev|etc|host|run|mnt/home|var/lib/lxcfs)($|/)'
  fields:
    uuid: xxxxx

#================================ Processors ==================================
processors:
- drop_event:
    when:
       regexp:
           system.diskio.name: "fd0|sr0"

#----------------------------- Logstash output --------------------------------
output.logstash:
  # The Logstash hosts
  hosts: ['xx:xx:xx:xx:xxxx']

  # Optional SSL. By default is off.
  # List of root certificates for HTTPS server verifications
  #ssl.certificate_authorities: ['']

  # Certificate for SSL client authentication
  #ssl.certificate: ''

  # Client Certificate Key
  #ssl.key: ''

#================================ Logging =====================================

# Sets log level. The default log level is info.
# Available log levels are: critical, error, warning, info, debug
#logging.level: debug

# At debug level, you can selectively enable logging only for some components.
# To enable all selectors use ["*"]. Examples of other selectors are "beat",
# "publish", "service".
#logging.selectors: ["*"]

td-agent config:

<source>
        @type beats
         metadata_as_tag
</source>
<filter *beat>
        @type record_transformer
        <record>
            tag ${tag}
        </record>
</filter>
<match metricbeat>
        ...
</match>

After looking at the code, in an effort trying to fix it myself, I feel like the error handling may possibly need a retry and I suspect the @thread.join to be preventing td-agent from properly restarting, however I'm not that experienced with Ruby so I may be completely wrong. While waiting for a response I'll keep trying to fix it and will try to keep this ticket updated.

martinhynar commented 6 years ago

Hi, I have exactly same problem, only that there is no network problem and restarting td-agent service does not help at all. The trace fills logs right after start. As this continues, td-agent consumes all available memory and then oom_killer comes in and starts shooting.

My environment: fluent-plugin-beats 0.1.3 td-agent 1.0.2 OS: Red Hat Enterprise Linux Server release 7.4

TeunRoefs commented 6 years ago

Hi @repeatedly ,

Did you have time to look into this? We are still experiencing this issue with fluentd and the fluent-plugin-beats v1.0.0. It does not matter how many nodes send metricbeat data towards fluentd.

We've got multiple environments that each have the same setup regarding fluentd and only one (mostly stack vm's (vcenter) environment. We're experiencing the most problems there.

It has multiple sites that can contain between 10 - 300 nodes and either of those sites shuts down randomly, but all with the same exact issues as described by Thom Koomen.

We're using multiple threads, writing the buffer to files but nothing seems to help.

Do you have any suggestions in regards to what we can look at to try and resolve the issue?

Kind regards,

Teun Roefs

repeatedly commented 5 years ago

@TeunRoefs Which problem happens on your environment? Shutdown is blocked or connection handling is not recovered after error? fluentd v1.x and beats plugin v1.x seems to not have former problem so latter error, right?

TeunRoefs commented 5 years ago

@repeatedly Seems its the latter error, we've tried multiple versions but the errors seems to persist. We're also looking into another plugin that might be the issue: https://github.com/htgc/fluent-plugin-azureeventhubs

I'll come back to you once I have more information.

repeatedly commented 5 years ago

@TeunRoefs I see. To debug the problem, I want to reproduce this bad fd problem but hard to it in my laptop. So I want to know how to reproduce this problem.

If closing bad socket resolves the problem, adding Errno::EBADF to error list may be a workaround.

https://github.com/repeatedly/fluent-plugin-beats/blob/6fabc421454de5c2428aeb457694b41629675882/lib/lumberjack/beats/server.rb#L326

TeunRoefs commented 5 years ago

@repeatedly What we do to reproduce this issue, we created a fluentd-data-collector VM that uses the fluent-plugin-beats and disable its connection (clean resolv.conf) onwards to the Azure EventHub that we're using. This causes Fluentd to 'hang' and metricbeat cannot send its data to Fluentd anymore. Funny thing is, metricbeat on localhost cannot even send its data to Fluent.

We re-enable the network connection (adding values to resolv.conf) and this should allow traffic to pass through, however, this is not the case.

repeatedly / fluent-plugin-beats

Plugin does not recover after network issues #6