uken / fluent-plugin-elasticsearch

Apache License 2.0
891 stars 310 forks source link

td-agent consuming high CPU, It's normal again after restart agent #751

Open qiujun623 opened 4 years ago

qiujun623 commented 4 years ago

Describe the bug td-agent sometimes has CPU overload, It can be solved again througth restart agent, I'm not sure if it's due to APP's bug or td-agent's configuration problem。And the problem is hard to reproduce. I think flush_thread_count=6, workers=6 mabe too high, Is it possible that flush_thread_count=6&retry_forever=true causing this problem?

To Reproduce This issue happens randomly, I'm not sure how to reproduce it.

Environment

Configuration

<system>
  log_level info
  flush_thread_count 6
  workers 6
</system>

<source>
  @type syslog
  port 5140
  bind 0.0.0.0
  tag system
  message_length_limit 2048
  <parse>
    @type syslog
  </parse>
</source>

<filter system.**>
  @type record_transformer
  enable_ruby true
  <record>
    iplist "#{(Socket.ip_address_list.find_all do |m|m.ipv4_private? end).collect do |n|n.ip_address end}"
    log_type ${tag_parts[1]}
    log_priority ${tag_parts[2]}
  </record>
</filter>

<filter system.**>
  @type elasticsearch_genid
  hash_id_key _hash    # storing generated hash id key (default is _hash)
</filter>

<match system.**>
  @type elasticsearch_dynamic
  hosts {{ fluentd_elastic_hosts }}
  user {{ fluentd_user }}
  password {{ fluentd_pwd }}
  logstash_format true
  logstash_prefix fluentd.system.${tag_parts[1]}
  id_key _hash
  remove_keys _hash
  request_timeout 300s
  reload_connections false
  reload_on_failure true
  reconnect_on_error true
  verify_es_version_at_startup false
  default_elasticsearch_version 6
  with_transporter_log false
  <buffer tag>
    @type file
    path "/var/log/td-agent/buffer-elasticsearch"
    total_limit_size 1G
    chunk_limit_size 10M
    flush_interval 60s
    queued_chunks_limit_size 30
    flush_thread_count 6
    retry_forever true
  </buffer>
</match>

Error Log

2020-04-25 11:27:34 +0800 [warn]: #1 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=21.247868629000095 slow_flush_log_threshold=20.0 plugin_id="object:3fd40c86a284"
2020-04-25 11:27:41 +0800 [warn]: #4 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=20.366897790000166 slow_flush_log_threshold=20.0 plugin_id="object:3fcaf23a3638"
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5263 td-agent  20   0   1749548  474704  5180 R 103.6 2.9   32:11.28 ruby
 5278 td-agent  20   0   1716780  424688  5180 S 103.6 2.6   32:09.17 ruby
 5273 td-agent  20   0   2038316  336980  5300 S 100.0 2.1  32:12.73 ruby
11:33:01.803528 futex(0x7f052d649010, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000021>
11:33:01.803598 futex(0x7f052d649044, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f052d649040, {FUTEX_OI 
11:33:01.803677 futex(0x7f052d649010, FUTEX_WAKE_PRIVATE, 1) = 1 <0.002795>
11:33:01.806532 futex(0x7f052d6274b4, FUTEX_WAIT_BITSET_PRIVATE, 13519, {2691, 539519237}, 1 
11:33:02.104163 futex(0x7f052d627528, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000036>
11:33:02.104255 futex(0x7f052d649044, FUTEX_WAIT_PRIVATE, 1403857, NULL) = 0 <0.007284> 
11:33:02.111586 futex(0x7f052d649010, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000020>
11:33:02.111652 futex(0x7f052d649044, FUTEX_WAIT_PRIVATE, 1403864, NULL) = 0 <0.003537> 
11:33:02.115234 futex(0x7f052d649010, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000020>
11:33:02.115298 futex(0x7f052d649044, FUTEX_WAIT_PRIVATE, 1403870, NULL) = 0 <0.009812> 
11:33:02.125162 futex(0x7f052d649010, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000021>
11:33:02.125235 futex(0x7f052d649044, FUTEX_WAIT_PRIVATE, 1403878, NULL) = 0 <0.011393> 
11:33:02.136681 futex(0x7f052d649010, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000021>
11:33:02.136753 futex(0x7f052d649044, FUTEX_WAIT_PRIVATE, 1403887, NULL) = 0 <0.067613>
11:33:02.204429 futpxf0x7f052d64907c. FIJTEX WAKE OP PRIVATE. 1. 1. 0x7f052d649078

Additional context

cosmo0920 commented 4 years ago

Could you collect dump with SIGCOND during high CPU usage? https://docs.fluentd.org/deployment/trouble-shooting#dump-fluentd-internal-information