Simple example with `puts` hangs depending on whether newlines included

machty commented 2 years ago

Ruby 3.1.0, async 2.0.0

There's a high probably that this isn't actually a bug, but in case it is:

This example works:

Async do |t|
  puts "Root start\n"
  t.async do
    puts "Sleep 1 Start\n"
    sleep 1
    puts "Sleep 1 End\n"
  end
  t.async do
    puts "Sleep 2 Start\n"
    sleep 1
    puts "Sleep 2 End\n"
  end
  puts "Root end\n"
end

and produces the following

Root start
Sleep 1 Start
Sleep 2 Start
Root end
Sleep 1 End
Sleep 2 End

But the following example, which simply removes the newlines, hangs:

Async do |t|
  puts "Root start"
  t.async do
    puts "Sleep 1 Start"
    sleep 1
    puts "Sleep 1 End"
  end
  t.async do
    puts "Sleep 2 Start"
    sleep 1
    puts "Sleep 2 End"
  end
  puts "Root end"
end

Here's the output it produces before it hangs:

Root start
Sleep 1 StartSleep 2 Start
Root end
Sleep 2 End

I'm guessing this has something to do with sharing the stdout IO object between Fibers, and that newlines flush the IO object to put it in a more usable state to pass to another fiber, but why would that cause a hang? Is that expected?

fredngo commented 2 years ago

Also Ruby 3.1.0, async 2.0.0

I can replicate and confirm this problem with the sleepy example from the docs:

def sleepy(duration = 1)
  Async do |task|
    task.sleep duration
    puts "I'm done sleeping, time for action!"
  end
end

# Synchronous operation:
sleepy

# Asynchronous operation:
Async do
  # These two functions will sleep simultaneously.
  sleepy
  sleepy
end

As it is written, the example hangs. If I change line 6 to include a \n:

puts "I'm done sleeping, time for action!\n"

Then the script runs without hanging.

ioquatix commented 2 years ago

Interesting, I'll check it.

This is all very new stuff, so it's chance there is bug. If so, we can try to fix it in 3.1.1.

ioquatix commented 2 years ago

After fixing the debug selector, I get this:

               |   RuntimeError: Cannot wait for #<IO:0x0000000100ae8e80> to become writable from multiple fibers.
               |   → /Users/samuel/.gem/ruby/3.1.0/gems/io-event-1.0.1/lib/io/event/debug/selector.rb:127 in `register_writable'
               |     /Users/samuel/.gem/ruby/3.1.0/gems/io-event-1.0.1/lib/io/event/debug/selector.rb:120 in `register_readable'
               |     /Users/samuel/.gem/ruby/3.1.0/gems/io-event-1.0.1/lib/io/event/debug/selector.rb:83 in `io_wait'
               |     lib/async/scheduler.rb:170 in `io_wait'
               |     ./test.rb:8 in `write'
               |     ./test.rb:8 in `puts'
               |     ./test.rb:8 in `puts'
               |     ./test.rb:8 in `block in sleepy'
               |     lib/async/task.rb:258 in `block in schedule'

It happens because Ruby internally is racing on $stdout. I'll think about how to deal with it.

machty commented 2 years ago

Why wouldn't I be able to trigger the same issue with just fibers? e.g.

def go
  puts "one"
  fiber = Fiber.new do
    puts "two"
    Fiber.yield
    puts "four"
  end
  fiber.resume
  puts "three"
  fiber.resume
  puts "five"
end
go

This produces the following without hanging:

one
two
three
four
five

ioquatix commented 2 years ago

Because there is no multiplexing.

Ruby has an internal IO lock, but it doesn't hold it for the whole operation which seems like a poor choice, because this means puts can interleave and we ultimately have more locking overhead since each chunk output acquires and releases the lock.

machty commented 2 years ago

I'm sorry, what do you mean by multiplexing, and what about Async (which runs on Fibers) would enable multiplexing, whereas the raw Fiber example wouldn't?

ioquatix commented 2 years ago

All IO operations (including puts) are multiplexed when using the fiber scheduler, puts internally is broken up into multiple writes. Between two fibers they can be interleaved, but the way IO#puts works is triggering a race condition in the kqueue backend. Probably this works fine on Linux with io_uring. Whether it should be that way or not is another question (probably it should be atomic).

machty commented 2 years ago

Thank you, I think I have a better sense of what you mean by multiplexing.

This is probably information overload, but it's helping me learn and maybe it'll help debug -- Here are some more simplified examples that demonstrate some different behaviors:

tp = TracePoint.new(:fiber_switch) do
  puts "fiber switched\n"
end
tp.enable

Here's a simple Async task with two newlined puts calls:

Async do |t|
  puts "start\n"
  puts "end\n"
end

This outputs

fiber switched
start
end
fiber switched

The following produces the same output as above.

Async do |t|
  $stdout.write "start\n"
  $stdout.write "end\n"
end

But if I remove the newline from the first puts:

Async do |t|
  puts "start"
  puts "end\n"
end

this outputs

fiber switched
Root startfiber switched
fiber switched

Root end
fiber switched

And here's what happens when I remove the newline from $stdout.write

Async do |t|
  $stdout.write "start"
  $stdout.write "end\n"
end

fiber switched
startend
fiber switched

From the behavior above, and from tracing the Ruby code, if you pass a single newline-terminated string to either #puts or #write, both paths end up calling io_write(io, str, 0) where the last arg is no_sync=0.

But non-newline-terminated newlines, as you said, get broken up. It seems to pass those args to some form of writev.

So I'm mostly curious at this point: why didn't the last example with consecutive writes (where the first one is missing a newline) cause a yield back to the scheduler?

ioquatix commented 2 years ago

It's not good practice to write to the same IO from different fibers/threads generally.

IO#puts "x" internally corresponds to the following operations:

IO.write("x", "\n")

This internally maps to rb_writev_internal with io.c:

https://github.com/ruby/ruby/blob/0ca00e2cb74f9d07d27844d97c29c208caab95a7/io.c#L1180-L1200

However the fiber scheduler doesn't support io vectors at this time... we might add support, or we might not. However, it's totally valid to only write the first iov as a partial write and expect Ruby to retry.

However, in this case, Ruby actually assumes that the IO is not ready for writing, so instead of trying again it calls io_wait. This schedules the IO into the event loop and ultimately kqueue. This is basically unnecessary in this case.

Anyway, to cut a long story short, it maps to the following sequence of operations:

write("x")
io_wait(WRITABLE)
write("\n")

However in the fiber scheduler, io_wait is a switch point so you end up with two fibers doing this:

(A) write("x")
(A) io_wait(WRITABLE)
(B) write("x")
(B) io_wait(WRITABLE) -> overwrites (A)
(B) write("\n")
(A) hangs forever.

Ruby has a write_lock for multi-thread access to synchronous IO, i.e. $stdin, $stdout. But in this case, the locking isn't around the whole operation, only the write operation, so we still break when io_wait occurs the 2nd time. I think we should fix this, personally.

ioquatix commented 2 years ago

The first part of the fix is to move the write_lock allocation out into a function so we can call it in other places: https://github.com/ruby/ruby/commit/0ca00e2cb74f9d07d27844d97c29c208caab95a7

machty commented 2 years ago

I really appreciate your helpful explanations.

B) io_wait(WRITABLE) -> overwrites (A)

What exactly is being "overwritten" here?

ioquatix commented 2 years ago

https://www.freebsd.org/cgi/man.cgi?query=kqueue&sektion=2

 EV_ADD     Adds the event to the kqueue.  Re-adding an existing event
    will modify the parameters of the original event, and not
    result in a duplicate entry.  Adding an event automatically
    enables it, unless overridden by the EV_DISABLE flag.

kqueue is very limited in this regard, it won't even report an error like epoll does. It simply overwrites the user data which contains the fiber pointer. So the previous io_waitt would never be triggered.

ioquatix commented 2 years ago

https://github.com/ruby/ruby/pull/5419

machty commented 2 years ago

How can I use the newly fixed debug selector to debug what I think is a similar issue in my Rails rspec suite?

IO_EVENT_DEBUG_SELECTOR=true be rspec spec/models/name_of_model_spec.rb:91

or something similar?

machty commented 2 years ago

@ioquatix just wanted to share that I just built against your Ruby PR and I still seem to be getting the same behavior:


RUBY_VERSION # => 3.2.0

Async do |t|
  puts "Root start"
  t.async do
    puts "Sleep 1 Start"
    sleep 1
    puts "Sleep 1 End"
  end
  t.async do
    puts "Sleep 2 Start"
    sleep 1
    puts "Sleep 2 End"
  end
  puts "Root end"
end

This outputs the following and hangs

Root start
Sleep 1 StartSleep 2 Start
Root end
Sleep 2 End

ioquatix commented 2 years ago

Can you try the latest version, it's working for me:

Root start
Sleep 1 Start
Root end
Sleep 2 Start
Sleep 1 End
Sleep 2 End

machty commented 2 years ago

@ioquatix When I tested earlier, it was using ruby-install and building the local source on your checked out branch w commit 60af64b25b. It's still behaving the same for me (are you sure you used the exact example in my most recently commit, with newlines omitted?), but now I'm trying to directly make install, following the instructions in hacking.md and I'm just getting stuck on these readline build errors and I'm not sure how to proceed

make[1]: Nothing to be done for `encs'.
compiling ../../../ext/readline/readline.c
../../../ext/readline/readline.c:1182:12: error: use of undeclared identifier 'rl_editing_mode'; did you mean 'rl_vi_editing_mode'?
    return rl_editing_mode == 0 ? Qtrue : Qfalse;
           ^~~~~~~~~~~~~~~
           rl_vi_editing_mode
../../../ext/readline/readline.c:1150:17: note: 'rl_vi_editing_mode' declared here
RUBY_EXTERN int rl_vi_editing_mode(int, int);
                ^
../../../ext/readline/readline.c:1221:12: error: use of undeclared identifier 'rl_editing_mode'; did you mean 'rl_vi_editing_mode'?
    return rl_editing_mode == 1 ? Qtrue : Qfalse;
           ^~~~~~~~~~~~~~~
           rl_vi_editing_mode
../../../ext/readline/readline.c:1150:17: note: 'rl_vi_editing_mode' declared here
RUBY_EXTERN int rl_vi_editing_mode(int, int);

Any pointers on getting a good ruby dev setup for pre-M1 mac macbook pros?

machty commented 2 years ago

@ioquatix OK I got the build working again, and I'm still getting the hanging behavior:

irb(main):035:0> end
Root start
Sleep 1 StartSleep 2 Start
Root end
Sleep 2 End
^C/Users/machty/.rubies/ruby-head/lib/ruby/3.2.0/irb.rb:437:in `raise': abort then interrupt! (IRB::Abort)

ioquatix commented 2 years ago

Thanks okay I'll take a look.

ioquatix commented 2 years ago

Can you tell me ruby --version?

machty commented 2 years ago

Here's me running the whole thing include the ruby --version at the top:

ruby --version
ruby 3.2.0dev (2022-01-09T06:48:23Z :detached: 60af64b25b) [x86_64-darwin20]
(base) code :: irb
irb(main):001:0> require 'async'
=> true
irb(main):002:0> Async do |t|
irb(main):003:0*     puts "Root start"
irb(main):004:0>   t.async do
irb(main):005:1*         puts "Sleep 1 Start"
irb(main):006:1>       sleep 1
irb(main):007:1>       puts "Sleep 1 End"
irb(main):008:1>     end
irb(main):009:0>   t.async do
irb(main):010:1*         puts "Sleep 2 Start"
irb(main):011:1>       sleep 1
irb(main):012:1>       puts "Sleep 2 End"
irb(main):013:1>     end
irb(main):014:0>   puts "Root end"
irb(main):015:0> end
Root start
Sleep 1 StartSleep 2 Start
Root end
Sleep 2 End
^C/Users/machty/.rubies/ruby-head/lib/ruby/3.2.0/irb.rb:437:in `raise': abort then interrupt! (IRB::Abort)
    from /Users/machty/.gem/ruby/3.2.0/gems/async-2.0.0/lib/async/scheduler.rb:224:in `select'
    from /Users/machty/.gem/ruby/3.2.0/gems/async-2.0.0/lib/async/scheduler.rb:224:in `run_once'
    from /Users/machty/.gem/ruby/3.2.0/gems/async-2.0.0/lib/async/scheduler.rb:243:in `run'
    from /Users/machty/.gem/ruby/3.2.0/gems/async-2.0.0/lib/kernel/async.rb:48:in `Async'
    from (irb):2:in `<main>'
    from /Users/machty/.rubies/ruby-head/lib/ruby/gems/3.2.0/gems/irb-1.4.1/exe/irb:11:in `<top (required)>'
    from /Users/machty/.rubies/ruby-head/bin/irb:25:in `load'
    from /Users/machty/.rubies/ruby-head/bin/irb:25:in `<main>'

ioquatix commented 2 years ago

I could reproduce the issue so I'll check it again.

jaredcwhite commented 2 years ago

I think I just got bit by this too with Ruby 3.1 and Async 2.0. Suddenly log statements in different fibers are missing newlines and then it hangs, stuck on the same scheduler.rb:224 line. Fine with downgrading to 1.9 for now, will keep on eye on this. Thanks :)

ioquatix commented 2 years ago

I'm going to see if we can roll this into 3.1.1 but it might be more systematic, just trying to triage the issue right now.

machty commented 2 years ago

I don't know if this is helpful or not, but if the suspect issue is due to race conditions in kqueue or some other io backend, wouldn't I also be able to reproduce that with threads?

Here's piece of code that spins up a bunch of threads that puts a random number:

(1..99).each_with_object([]) {|i, threads| threads << Thread.new {sleep(rand 2) && puts(i)} }.each(&:join)

I remember this used to cause mangled output where some of the puts would end up on the same line, but ever since 2.5.0 (and up to ruby-head), the output is clean. Ruby 2.4.3 has the mangled output I remember. From the 2.5.0 changelog, this patch to start using writev seems like the most likely "fix" for the mangling: https://bugs.ruby-lang.org/issues/9323

Anyway, just sharing in case it knocks some thinking loose.

ioquatix commented 2 years ago

The problem is kqueue can't handle multiple events for the same file descriptor with different user data. I think I'm going to have to rework this to allow it by tracking when this situation occurs and queueing up fibers, because there are going to be too many race conditions which lead to this situation.

Basically:

add_kevent(fd, readable, fiber1)
add_kevent(fd, readable, fiber2)

The 2nd registration clobbers the first. In Async 1.x this was an error, but Async 2 might not be able to be so strict and still get the level of concurrency we want. Also, io_uring handles this case correctly, so on the fast path (modern Linux) this is less of an issue.

emiltin commented 2 years ago

thanks for the great work on async 2.0. i have the same issue here after trying upgrading to 2.0 and ruby 3.1 on os x - newlines missing from output and puts() hanging.

appending "\n" to strings before put()ing them seem to work as a workaround.

machty commented 2 years ago

I believe the root issue might be fixed here https://github.com/socketry/io-event/pull/20

ioquatix commented 2 years ago

I believe this should be addressed by io-event v1.0.2 - can you please test it?

pbstriker38 commented 2 years ago

I'm on io-event v1.0.2 and do not see this issue.

output

Root start
Sleep 1 StartSleep 2 Start

Root end
Sleep 1 EndSleep 2 End

jaredcwhite commented 2 years ago

Unfortunately v1.0.2 of io-event didn't fully resolve the issue for me…it's adding a widely varying number of additional seconds to the build process and the output is strange. The build process with async 1.30.1:

        Generating…
       Prismic API: Connecting...
       Prismic API: Loading blog_post...
       Prismic API: Loading page...
        Pagination: Complete, processed 1 pagination page(s)
            Done! 🎉 Completed in less than 0.48 seconds.

and with async 2.0.0 + io-event 1.0.2:

        Generating…
       Prismic API: Connecting...
       Prismic API: Loading blog_post...       Prismic API: Loading page...

        Pagination: Complete, processed 1 pagination page(s)
            Done! 🎉 Completed in less than 17.89 seconds.

This is when using ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [arm64-darwin21].

machty commented 2 years ago

@jaredcwhite is the hanging issue fixed at least?

machty commented 2 years ago

@ioquatix Did some poking around on your Ruby PR and Ruby io in general, maybe this helps:

When I boot irb, $stdout.sync == true, i.e. $stdout is initialized for synchronous IO (immediately flush to OS after write).

If I set $stdout.sync = false before proceeding with the non-newline example we've been using in this issue, I get clean output:

load 'scratch/async/async_test_case2.rb'
Sleep 1 StartSleep 2 Start

Root end
Sleep 1 EndSleep 2 End

=> true
irb(main):002:0> $stdout.sync = false
=> false
irb(main):003:0> load 'scratch/async/async_test_case2.rb'
Root start
Sleep 1 Start
Root end
Sleep 2 Start
Sleep 1 End
Sleep 2 End
=> true
irb(main):004:0> $stdout.sync = true # set sync back to true; not expecting this to change the output back, but rather just confirming the comment here https://github.com/ioquatix/ruby/blob/d4a39523c8dab50768f33e1719523f3fc2a4fbe4/include/ruby/io.h#L254
=> true
irb(main):005:0> load 'scratch/async/async_test_case2.rb'
Root start
Sleep 1 Start
Root end
Sleep 2 Start
Sleep 1 End
Sleep 2 End
=> true

I believe the root cause is that when $stdout.sync = true (the default), the write_lock mutex never gets initialized due to this logic: https://github.com/ioquatix/ruby/blob/d4a39523c8dab50768f33e1719523f3fc2a4fbe4/io.c#L1692

Seems like write_lock is coupled to the presence of a write buffer; maybe with your PR the write_lock can be initialized regardless?

machty commented 2 years ago

I implemented the change I described above here and it fixes the issue (it produces the clean non-mangled output [and doesn't hang]) https://github.com/ruby/ruby/commit/756b25466fda0c829498853cbd13c8ac97288bab

Output

Root start
Sleep 1 Start
Root end
Sleep 2 Start
Sleep 1 End
Sleep 2 End

ioquatix commented 2 years ago

This should be completely fixed on Ruby head and in Ruby 3.2.0 when it is released.

ioquatix commented 2 years ago

cc @mame FYI.

socketry / async

Simple example with `puts` hangs depending on whether newlines included #137