socketry / async

An awesome asynchronous event-driven reactor for Ruby.
MIT License
2.16k stars 90 forks source link

`system` hangs if file does not exist #351

Closed cretz closed 5 days ago

cretz commented 3 weeks ago

If you run system inside Async with a non-existent executable on Linux with Ruby 3.3 it hangs. This behavior is confirmed in 2.17.0 and 2.18.0 (I didn't check any earlier than that).

This synchronous code works:

puts "Running on: #{RbConfig::CONFIG['host_os']}"
res = system 'echo', 'hello'
puts "Result 1: #{res}"
res = system 'does-not-exist', 'hello'
puts "Result 2: #{res}"

And outputs:

Running on: linux
hello
Result 1: true
Result 2:

But this code does not:

require 'async'

Async do
  puts "Running on: #{RbConfig::CONFIG['host_os']}"
  res = system 'echo', 'hello'
  puts "Result 1: #{res}"
  res = system 'does-not-exist', 'hello'
  puts "Result 2: #{res}"
end

This hangs on the second system call. This was discovered after a deep dive of why Minitest assert_equal was hanging. It turns out, Minitest uses diff on the system when reporting an assert_equal failure, but first checks for gdiff which we don't have, ref: https://github.com/minitest/minitest/blob/e4417c5b13ab917a60c218700041152136bc07c8/lib/minitest/assertions.rb#L33. So we have to Minitest::Assertions.diff = 'diff' to get Async-based tests to not break here.

(sorry if this is already reported, I could not find an issue)

ioquatix commented 3 weeks ago

Nice find. I didn't know about this! haha.

image

I'll investigate the issue today.

ioquatix commented 3 weeks ago

I could not reproduce from the given example.

samuel@aiko ~/D/s/async (main)> ruby --version
ruby 3.3.4 (2024-07-09 revision be1089c8ec) [x86_64-linux]
samuel@aiko ~/D/s/async (main)> cat ./test.rb
#!/usr/bin/env ruby

require "async"

Async do
    puts "Running on: #{RbConfig::CONFIG['host_os']}"
    res = system 'echo', 'hello'
    puts "Result 1: #{res.inspect}"
    res = system 'does-not-exist', 'hello'
    puts "Result 2: #{res.inspect}"
end
samuel@aiko ~/D/s/async (main)> bundle exec ./test.rb
Running on: linux
hello
Result 1: true
Result 2: nil

Can you confirm I've got the correct reproduction script?

ioquatix commented 3 weeks ago

I tried to reproduce the original error but couldn't:

#!/usr/bin/env ruby

require "minitest/autorun"
require "async"

class Test < Minitest::Test
    BIG_TEXT_A = "The quick brown fox jumps over the lazy dog.\n" * 10
    BIG_TEXT_B = BIG_TEXT_A.dup.sub("quick", "slow")

    def test_diff
        Async do |task|
            assert_equal(BIG_TEXT_A, BIG_TEXT_B)
        end
    end
end

I added some instrumentation to the scheduler process_wait hook:

        # Wait for the specified process ID to exit.
        # @parameter pid [Integer] The process ID to wait for.
        # @parameter flags [Integer] A bit-mask of flags suitable for `Process::Status.wait`.
        # @returns [Process::Status] A process status instance.
        # @asynchronous May be non-blocking..
        def process_wait(pid, flags)
            return @selector.process_wait(Fiber.current, pid, flags).tap{|result|
                Fiber.blocking{$stderr.puts "process_wait(#{pid}, #{flags}) -> #{result.inspect}"}
            }
        end

It appears to work okay:

> bundle exec ./test.rb
Run options: --seed 20496

# Running:

process_wait(927854, 0) -> #<Process::Status: pid 927854 exit 127>
process_wait(927855, 0) -> #<Process::Status: pid 927855 exit 0>
process_wait(927856, 0) -> #<Process::Status: pid 927856 exit 1>
F

Finished in 0.002793s, 358.0683 runs/s, 358.0683 assertions/s.

  1) Failure:
Test#test_diff [test.rb:12]:
--- expected
+++ actual
@@ -1,4 +1,4 @@
-"The quick brown fox jumps over the lazy dog.
+"The slow brown fox jumps over the lazy dog.
 The quick brown fox jumps over the lazy dog.
 The quick brown fox jumps over the lazy dog.
 The quick brown fox jumps over the lazy dog.

1 runs, 1 assertions, 1 failures, 0 errors, 0 skips
ioquatix commented 3 weeks ago

Is it possible something in your test suite is monkey patching Kernel::system?

cretz commented 3 weeks ago

Interesting, this is breaking for me:

cretz@DESKTOP:~/work/temp$ cat ./test.rb
require 'async'

Async do
  puts "Running on: #{RbConfig::CONFIG['host_os']}"
  res = system 'echo', 'hello'
  puts "Result 1: #{res}"
  res = system 'does-not-exist', 'hello'
  puts "Result 2: #{res}"
end
cretz@DESKTOP:~/work/temp$ bundle exec ruby --version
ruby 3.3.3 (2024-06-12 revision f1c7b6f435) [x86_64-linux]
cretz@DESKTOP:~/work/temp$ bundle exec ruby ./test.rb
Running on: linux
hello
Result 1: true

It hangs right there. I am not doing any monkey patching. This was also hanging for me inside of a GitHub action for Ubuntu and Ruby 3.3 (but not 3.1 I don't think). I will do some more investigation, possibly including a simple repository that replicates the issue with a simple GitHub action, and get back. (may be a bit, I am traveling)

olleolleolle commented 3 weeks ago

I was also able to make it hang on that.

# issue-351.rb
require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'async'
end

Async do
  puts "Running on: #{RbConfig::CONFIG['host_os']}"
  res = system 'echo', 'hello'
  puts "Result 1: #{res}"
  res = system 'does-not-exist', 'hello'
  puts "Result 2: #{res}"
end

My output was: note where I hit Ctrl-C and it got output, and then the first word Result 2:

$ ruby issue-351.rb
Running on: linux
hello
Result 1: true
^CResult 2:
/home/olle/projects/slask/tmp/devbox/gems/gems/async-2.18.0/lib/async/scheduler.rb:317:in `select': Interrupt
        from /home/olle/projects/slask/tmp/devbox/gems/gems/async-2.18.0/lib/async/scheduler.rb:317:in `run_once!'
        from /home/olle/projects/slask/tmp/devbox/gems/gems/async-2.18.0/lib/async/scheduler.rb:353:in `run_once'
        from /home/olle/projects/slask/tmp/devbox/gems/gems/async-2.18.0/lib/async/scheduler.rb:408:in `block in run'
        from /home/olle/projects/slask/tmp/devbox/gems/gems/async-2.18.0/lib/async/scheduler.rb:386:in `block in run_loop'
        from /home/olle/projects/slask/tmp/devbox/gems/gems/async-2.18.0/lib/async/scheduler.rb:383:in `handle_interrupt'
        from /home/olle/projects/slask/tmp/devbox/gems/gems/async-2.18.0/lib/async/scheduler.rb:383:in `run_loop'
        from /home/olle/projects/slask/tmp/devbox/gems/gems/async-2.18.0/lib/async/scheduler.rb:407:in `run'
        from /home/olle/projects/slask/tmp/devbox/gems/gems/async-2.18.0/lib/kernel/async.rb:34:in `Async'
        from issue-351.rb:8:in `<main>'

See Bundler inline docs.

ioquatix commented 3 weeks ago

lol, I cannot reproduce, I tried IO_EVENT_SELECTOR=Select too.

Okay, can you please tell me:

ioquatix commented 3 weeks ago

For me it's:

samuel@aiko ~/D/tmp> uname -a
Linux aiko 6.11.6-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 01 Nov 2024 03:30:41 +0000 x86_64 GNU/Linux
samuel@aiko ~/D/tmp> ruby -v
ruby 3.3.3 (2024-06-12 revision f1c7b6f435) [x86_64-linux]
samuel@aiko ~/D/tmp> $SHELL -v
fish, version 3.7.1
ioquatix commented 3 weeks ago

In Async v2.19, I've introduced more debug output: https://socketry.github.io/async/releases/index#async::scheduler-debugging

Can one of you run the hanging example with CONSOLE_LEVEL=debug environment variable set, and show me the full output?

If it's not too verbose, it might also be useful to see the output generated by RUBYOPT=-d ....

Thanks.

ioquatix commented 2 weeks ago

This appears to be a problem in Ruby process.c which can fork with an active scheduler without cleaning up correctly. I think at best we can fix this in Ruby 3.4 and maybe backport the fix. I'll need to investigate further.

ioquatix commented 2 weeks ago

See https://github.com/ruby/ruby/pull/12052 for a possible fix.

ioquatix commented 5 days ago

Okay, I believe I've found the root cause, a race condition in the process wait code.

ioquatix commented 5 days ago

I've released io-event v1.7.4 which appears to fix this issue. Please let me know if you are still having issues.