treasure-data / serverengine

A framework to implement robust multiprocess servers like Unicorn
Apache License 2.0
759 stars 86 forks source link

Suspected resource leak in worker recovery feature #145

Open ashie opened 8 months ago

ashie commented 8 months ago

In fluent-package v5.0.2 for Windows, we observed that supervisor process finished unexpectedly after about 1 hour passed while repeating worker recovery. After digging this issue, we found handle leak in the supervisor process.

https://github.com/fluent/fluent-package-builder/pull/620#issuecomment-1958718794

While this situation, opened handles are continually increased, over than 8400 at last.

We got following backtrace on finishing supervisor process.

2024-02-21 15:34:23 +0900 [debug]: fluent/log.rb:341:debug: Got Win32 event "fluentd_7928_STOP_EVENT_THREAD"
Unexpected error undefined method `pid' for nil:NilClass
  C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/supervisor.rb:417:in `after_start'
  C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_spawn_server.rb:77:in `ensure in start_worker'
  C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_spawn_server.rb:77:in `start_worker'
  C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:175:in `delayed_start_worker'
  C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:159:in `restart_worker'
  C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:125:in `block in keepalive_workers'
  C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:102:in `each'
  C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:102:in `each_with_index'
  C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:102:in `keepalive_workers'
  C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:58:in `run'
  C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_spawn_server.rb:50:in `run'
  C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/server.rb:128:in `main'
  C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/daemon.rb:119:in `main'
  C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/daemon.rb:68:in `run'
  C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/supervisor.rb:796:in `supervise'
  C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/supervisor.rb:582:in `run_supervisor'
  C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/command/fluentd.rb:352:in `<top (required)>'
  <internal:C:/opt/fluent/lib/ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:85:in `require'
  <internal:C:/opt/fluent/lib/ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:85:in `require'
  C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/bin/fluentd:15:in `<top (required)>'
  C:/opt/fluent/bin/fluentd:32:in `load'
  C:/opt/fluent/bin/fluentd:32:in `<main>'

In the above log, the root cause is squashed by ensure. So I fetched additional backtrace of the exception:

#<Errno::EMFILE: Too many open files - dup>
C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/process_manager.rb:190:in `spawn'
C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/supervisor.rb:413:in `spawn'
C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_spawn_server.rb:75:in `start_worker'
C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:175:in `delayed_start_worker'
C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:159:in `restart_worker'
C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:125:in `block in keepalive_workers'
C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:102:in `each'
C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:102:in `each_with_index'
C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:102:in `keepalive_workers'
C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:58:in `run'
C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_spawn_server.rb:50:in `run'
C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/server.rb:128:in `main'
C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/daemon.rb:119:in `main'
C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/daemon.rb:68:in `run'
C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/supervisor.rb:796:in `supervise'
C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/supervisor.rb:582:in `run_supervisor'
C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/command/fluentd.rb:352:in `<top (required)>'
<internal:C:/opt/fluent/lib/ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:85:in `require'
<internal:C:/opt/fluent/lib/ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:85:in `require'
C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/bin/fluentd:15:in `<top (required)>'
C:/opt/fluent/bin/fluentd:32:in `load'
C:/opt/fluent/bin/fluentd:32:in `<main>'

Although I'm not sure it's Fluentd's issue or ServerEngine's issue yet, we'll track this issue here until we clarify the root cause.