Open ashie opened 8 months ago
In fluent-package v5.0.2 for Windows, we observed that supervisor process finished unexpectedly after about 1 hour passed while repeating worker recovery. After digging this issue, we found handle leak in the supervisor process.
https://github.com/fluent/fluent-package-builder/pull/620#issuecomment-1958718794
While this situation, opened handles are continually increased, over than 8400 at last. We got following backtrace on finishing supervisor process. 2024-02-21 15:34:23 +0900 [debug]: fluent/log.rb:341:debug: Got Win32 event "fluentd_7928_STOP_EVENT_THREAD" Unexpected error undefined method `pid' for nil:NilClass C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/supervisor.rb:417:in `after_start' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_spawn_server.rb:77:in `ensure in start_worker' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_spawn_server.rb:77:in `start_worker' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:175:in `delayed_start_worker' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:159:in `restart_worker' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:125:in `block in keepalive_workers' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:102:in `each' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:102:in `each_with_index' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:102:in `keepalive_workers' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:58:in `run' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_spawn_server.rb:50:in `run' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/server.rb:128:in `main' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/daemon.rb:119:in `main' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/daemon.rb:68:in `run' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/supervisor.rb:796:in `supervise' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/supervisor.rb:582:in `run_supervisor' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/command/fluentd.rb:352:in `<top (required)>' <internal:C:/opt/fluent/lib/ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:85:in `require' <internal:C:/opt/fluent/lib/ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:85:in `require' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/bin/fluentd:15:in `<top (required)>' C:/opt/fluent/bin/fluentd:32:in `load' C:/opt/fluent/bin/fluentd:32:in `<main>' In the above log, the root cause is squashed by ensure. So I fetched additional backtrace of the exception: #<Errno::EMFILE: Too many open files - dup> C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/process_manager.rb:190:in `spawn' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/supervisor.rb:413:in `spawn' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_spawn_server.rb:75:in `start_worker' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:175:in `delayed_start_worker' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:159:in `restart_worker' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:125:in `block in keepalive_workers' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:102:in `each' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:102:in `each_with_index' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:102:in `keepalive_workers' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:58:in `run' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_spawn_server.rb:50:in `run' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/server.rb:128:in `main' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/daemon.rb:119:in `main' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/daemon.rb:68:in `run' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/supervisor.rb:796:in `supervise' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/supervisor.rb:582:in `run_supervisor' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/command/fluentd.rb:352:in `<top (required)>' <internal:C:/opt/fluent/lib/ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:85:in `require' <internal:C:/opt/fluent/lib/ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:85:in `require' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/bin/fluentd:15:in `<top (required)>' C:/opt/fluent/bin/fluentd:32:in `load' C:/opt/fluent/bin/fluentd:32:in `<main>'
While this situation, opened handles are continually increased, over than 8400 at last.
We got following backtrace on finishing supervisor process.
2024-02-21 15:34:23 +0900 [debug]: fluent/log.rb:341:debug: Got Win32 event "fluentd_7928_STOP_EVENT_THREAD" Unexpected error undefined method `pid' for nil:NilClass C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/supervisor.rb:417:in `after_start' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_spawn_server.rb:77:in `ensure in start_worker' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_spawn_server.rb:77:in `start_worker' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:175:in `delayed_start_worker' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:159:in `restart_worker' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:125:in `block in keepalive_workers' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:102:in `each' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:102:in `each_with_index' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:102:in `keepalive_workers' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:58:in `run' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_spawn_server.rb:50:in `run' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/server.rb:128:in `main' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/daemon.rb:119:in `main' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/daemon.rb:68:in `run' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/supervisor.rb:796:in `supervise' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/supervisor.rb:582:in `run_supervisor' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/command/fluentd.rb:352:in `<top (required)>' <internal:C:/opt/fluent/lib/ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:85:in `require' <internal:C:/opt/fluent/lib/ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:85:in `require' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/bin/fluentd:15:in `<top (required)>' C:/opt/fluent/bin/fluentd:32:in `load' C:/opt/fluent/bin/fluentd:32:in `<main>'
In the above log, the root cause is squashed by ensure. So I fetched additional backtrace of the exception:
ensure
#<Errno::EMFILE: Too many open files - dup> C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/process_manager.rb:190:in `spawn' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/supervisor.rb:413:in `spawn' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_spawn_server.rb:75:in `start_worker' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:175:in `delayed_start_worker' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:159:in `restart_worker' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:125:in `block in keepalive_workers' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:102:in `each' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:102:in `each_with_index' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:102:in `keepalive_workers' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_worker_server.rb:58:in `run' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/multi_spawn_server.rb:50:in `run' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/server.rb:128:in `main' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/daemon.rb:119:in `main' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/serverengine-2.3.2/lib/serverengine/daemon.rb:68:in `run' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/supervisor.rb:796:in `supervise' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/supervisor.rb:582:in `run_supervisor' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/command/fluentd.rb:352:in `<top (required)>' <internal:C:/opt/fluent/lib/ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:85:in `require' <internal:C:/opt/fluent/lib/ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:85:in `require' C:/opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/bin/fluentd:15:in `<top (required)>' C:/opt/fluent/bin/fluentd:32:in `load' C:/opt/fluent/bin/fluentd:32:in `<main>'
Although I'm not sure it's Fluentd's issue or ServerEngine's issue yet, we'll track this issue here until we clarify the root cause.
In fluent-package v5.0.2 for Windows, we observed that supervisor process finished unexpectedly after about 1 hour passed while repeating worker recovery. After digging this issue, we found handle leak in the supervisor process.
https://github.com/fluent/fluent-package-builder/pull/620#issuecomment-1958718794
Although I'm not sure it's Fluentd's issue or ServerEngine's issue yet, we'll track this issue here until we clarify the root cause.