Closed machty closed 2 years ago
Also Ruby 3.1.0, async 2.0.0
I can replicate and confirm this problem with the sleepy
example from the docs:
def sleepy(duration = 1)
Async do |task|
task.sleep duration
puts "I'm done sleeping, time for action!"
end
end
# Synchronous operation:
sleepy
# Asynchronous operation:
Async do
# These two functions will sleep simultaneously.
sleepy
sleepy
end
As it is written, the example hangs. If I change line 6 to include a \n
:
puts "I'm done sleeping, time for action!\n"
Then the script runs without hanging.
Interesting, I'll check it.
This is all very new stuff, so it's chance there is bug. If so, we can try to fix it in 3.1.1.
After fixing the debug selector, I get this:
| RuntimeError: Cannot wait for #<IO:0x0000000100ae8e80> to become writable from multiple fibers.
| → /Users/samuel/.gem/ruby/3.1.0/gems/io-event-1.0.1/lib/io/event/debug/selector.rb:127 in `register_writable'
| /Users/samuel/.gem/ruby/3.1.0/gems/io-event-1.0.1/lib/io/event/debug/selector.rb:120 in `register_readable'
| /Users/samuel/.gem/ruby/3.1.0/gems/io-event-1.0.1/lib/io/event/debug/selector.rb:83 in `io_wait'
| lib/async/scheduler.rb:170 in `io_wait'
| ./test.rb:8 in `write'
| ./test.rb:8 in `puts'
| ./test.rb:8 in `puts'
| ./test.rb:8 in `block in sleepy'
| lib/async/task.rb:258 in `block in schedule'
It happens because Ruby internally is racing on $stdout
. I'll think about how to deal with it.
Why wouldn't I be able to trigger the same issue with just fibers? e.g.
def go
puts "one"
fiber = Fiber.new do
puts "two"
Fiber.yield
puts "four"
end
fiber.resume
puts "three"
fiber.resume
puts "five"
end
go
This produces the following without hanging:
one
two
three
four
five
Because there is no multiplexing.
Ruby has an internal IO lock, but it doesn't hold it for the whole operation which seems like a poor choice, because this means puts
can interleave and we ultimately have more locking overhead since each chunk output acquires and releases the lock.
I'm sorry, what do you mean by multiplexing, and what about Async (which runs on Fibers) would enable multiplexing, whereas the raw Fiber example wouldn't?
All IO operations (including puts
) are multiplexed when using the fiber scheduler, puts
internally is broken up into multiple writes. Between two fibers they can be interleaved, but the way IO#puts
works is triggering a race condition in the kqueue
backend. Probably this works fine on Linux with io_uring
. Whether it should be that way or not is another question (probably it should be atomic).
Thank you, I think I have a better sense of what you mean by multiplexing.
This is probably information overload, but it's helping me learn and maybe it'll help debug -- Here are some more simplified examples that demonstrate some different behaviors:
tp = TracePoint.new(:fiber_switch) do
puts "fiber switched\n"
end
tp.enable
Here's a simple Async task with two newlined puts
calls:
Async do |t|
puts "start\n"
puts "end\n"
end
This outputs
fiber switched
start
end
fiber switched
The following produces the same output as above.
Async do |t|
$stdout.write "start\n"
$stdout.write "end\n"
end
But if I remove the newline from the first puts:
Async do |t|
puts "start"
puts "end\n"
end
this outputs
fiber switched
Root startfiber switched
fiber switched
Root end
fiber switched
And here's what happens when I remove the newline from $stdout.write
Async do |t|
$stdout.write "start"
$stdout.write "end\n"
end
fiber switched
startend
fiber switched
From the behavior above, and from tracing the Ruby code, if you pass a single newline-terminated string to either #puts
or #write
, both paths end up calling io_write(io, str, 0)
where the last arg is no_sync=0.
But non-newline-terminated newlines, as you said, get broken up. It seems to pass those args to some form of writev
.
So I'm mostly curious at this point: why didn't the last example with consecutive write
s (where the first one is missing a newline) cause a yield back to the scheduler?
It's not good practice to write to the same IO from different fibers/threads generally.
IO#puts "x"
internally corresponds to the following operations:
IO.write("x", "\n")
This internally maps to rb_writev_internal
with io.c
:
https://github.com/ruby/ruby/blob/0ca00e2cb74f9d07d27844d97c29c208caab95a7/io.c#L1180-L1200
However the fiber scheduler doesn't support io vectors at this time... we might add support, or we might not. However, it's totally valid to only write the first iov as a partial write and expect Ruby to retry.
However, in this case, Ruby actually assumes that the IO is not ready for writing, so instead of trying again it calls io_wait
. This schedules the IO into the event loop and ultimately kqueue. This is basically unnecessary in this case.
Anyway, to cut a long story short, it maps to the following sequence of operations:
write("x")
io_wait(WRITABLE)
write("\n")
However in the fiber scheduler, io_wait
is a switch point so you end up with two fibers doing this:
(A) write("x")
(A) io_wait(WRITABLE)
(B) write("x")
(B) io_wait(WRITABLE) -> overwrites (A)
(B) write("\n")
(A) hangs forever.
Ruby has a write_lock
for multi-thread access to synchronous IO, i.e. $stdin
, $stdout
. But in this case, the locking isn't around the whole operation, only the write
operation, so we still break when io_wait
occurs the 2nd time. I think we should fix this, personally.
The first part of the fix is to move the write_lock
allocation out into a function so we can call it in other places: https://github.com/ruby/ruby/commit/0ca00e2cb74f9d07d27844d97c29c208caab95a7
I really appreciate your helpful explanations.
B) io_wait(WRITABLE) -> overwrites (A)
What exactly is being "overwritten" here?
https://www.freebsd.org/cgi/man.cgi?query=kqueue&sektion=2
EV_ADD Adds the event to the kqueue. Re-adding an existing event will modify the parameters of the original event, and not result in a duplicate entry. Adding an event automatically enables it, unless overridden by the EV_DISABLE flag.
kqueue is very limited in this regard, it won't even report an error like epoll
does. It simply overwrites the user data which contains the fiber pointer. So the previous io_waitt
would never be triggered.
How can I use the newly fixed debug selector to debug what I think is a similar issue in my Rails rspec suite?
IO_EVENT_DEBUG_SELECTOR=true be rspec spec/models/name_of_model_spec.rb:91
or something similar?
@ioquatix just wanted to share that I just built against your Ruby PR and I still seem to be getting the same behavior:
RUBY_VERSION # => 3.2.0
Async do |t|
puts "Root start"
t.async do
puts "Sleep 1 Start"
sleep 1
puts "Sleep 1 End"
end
t.async do
puts "Sleep 2 Start"
sleep 1
puts "Sleep 2 End"
end
puts "Root end"
end
This outputs the following and hangs
Root start
Sleep 1 StartSleep 2 Start
Root end
Sleep 2 End
Can you try the latest version, it's working for me:
Root start
Sleep 1 Start
Root end
Sleep 2 Start
Sleep 1 End
Sleep 2 End
@ioquatix When I tested earlier, it was using ruby-install and building the local source on your checked out branch w commit 60af64b25b
. It's still behaving the same for me (are you sure you used the exact example in my most recently commit, with newlines omitted?), but now I'm trying to directly make install
, following the instructions in hacking.md
and I'm just getting stuck on these readline build errors and I'm not sure how to proceed
make[1]: Nothing to be done for `encs'.
compiling ../../../ext/readline/readline.c
../../../ext/readline/readline.c:1182:12: error: use of undeclared identifier 'rl_editing_mode'; did you mean 'rl_vi_editing_mode'?
return rl_editing_mode == 0 ? Qtrue : Qfalse;
^~~~~~~~~~~~~~~
rl_vi_editing_mode
../../../ext/readline/readline.c:1150:17: note: 'rl_vi_editing_mode' declared here
RUBY_EXTERN int rl_vi_editing_mode(int, int);
^
../../../ext/readline/readline.c:1221:12: error: use of undeclared identifier 'rl_editing_mode'; did you mean 'rl_vi_editing_mode'?
return rl_editing_mode == 1 ? Qtrue : Qfalse;
^~~~~~~~~~~~~~~
rl_vi_editing_mode
../../../ext/readline/readline.c:1150:17: note: 'rl_vi_editing_mode' declared here
RUBY_EXTERN int rl_vi_editing_mode(int, int);
Any pointers on getting a good ruby dev setup for pre-M1 mac macbook pros?
@ioquatix OK I got the build working again, and I'm still getting the hanging behavior:
irb(main):035:0> end
Root start
Sleep 1 StartSleep 2 Start
Root end
Sleep 2 End
^C/Users/machty/.rubies/ruby-head/lib/ruby/3.2.0/irb.rb:437:in `raise': abort then interrupt! (IRB::Abort)
Thanks okay I'll take a look.
Can you tell me ruby --version
?
Here's me running the whole thing include the ruby --version at the top:
ruby --version
ruby 3.2.0dev (2022-01-09T06:48:23Z :detached: 60af64b25b) [x86_64-darwin20]
(base) code :: irb
irb(main):001:0> require 'async'
=> true
irb(main):002:0> Async do |t|
irb(main):003:0* puts "Root start"
irb(main):004:0> t.async do
irb(main):005:1* puts "Sleep 1 Start"
irb(main):006:1> sleep 1
irb(main):007:1> puts "Sleep 1 End"
irb(main):008:1> end
irb(main):009:0> t.async do
irb(main):010:1* puts "Sleep 2 Start"
irb(main):011:1> sleep 1
irb(main):012:1> puts "Sleep 2 End"
irb(main):013:1> end
irb(main):014:0> puts "Root end"
irb(main):015:0> end
Root start
Sleep 1 StartSleep 2 Start
Root end
Sleep 2 End
^C/Users/machty/.rubies/ruby-head/lib/ruby/3.2.0/irb.rb:437:in `raise': abort then interrupt! (IRB::Abort)
from /Users/machty/.gem/ruby/3.2.0/gems/async-2.0.0/lib/async/scheduler.rb:224:in `select'
from /Users/machty/.gem/ruby/3.2.0/gems/async-2.0.0/lib/async/scheduler.rb:224:in `run_once'
from /Users/machty/.gem/ruby/3.2.0/gems/async-2.0.0/lib/async/scheduler.rb:243:in `run'
from /Users/machty/.gem/ruby/3.2.0/gems/async-2.0.0/lib/kernel/async.rb:48:in `Async'
from (irb):2:in `<main>'
from /Users/machty/.rubies/ruby-head/lib/ruby/gems/3.2.0/gems/irb-1.4.1/exe/irb:11:in `<top (required)>'
from /Users/machty/.rubies/ruby-head/bin/irb:25:in `load'
from /Users/machty/.rubies/ruby-head/bin/irb:25:in `<main>'
I could reproduce the issue so I'll check it again.
I think I just got bit by this too with Ruby 3.1 and Async 2.0. Suddenly log statements in different fibers are missing newlines and then it hangs, stuck on the same scheduler.rb:224
line. Fine with downgrading to 1.9 for now, will keep on eye on this. Thanks :)
I'm going to see if we can roll this into 3.1.1 but it might be more systematic, just trying to triage the issue right now.
I don't know if this is helpful or not, but if the suspect issue is due to race conditions in kqueue or some other io backend, wouldn't I also be able to reproduce that with threads?
Here's piece of code that spins up a bunch of threads that puts
a random number:
(1..99).each_with_object([]) {|i, threads| threads << Thread.new {sleep(rand 2) && puts(i)} }.each(&:join)
I remember this used to cause mangled output where some of the puts
would end up on the same line, but ever since 2.5.0 (and up to ruby-head), the output is clean. Ruby 2.4.3 has the mangled output I remember. From the 2.5.0 changelog, this patch to start using writev seems like the most likely "fix" for the mangling: https://bugs.ruby-lang.org/issues/9323
Anyway, just sharing in case it knocks some thinking loose.
The problem is kqueue
can't handle multiple events for the same file descriptor with different user data. I think I'm going to have to rework this to allow it by tracking when this situation occurs and queueing up fibers, because there are going to be too many race conditions which lead to this situation.
Basically:
add_kevent(fd, readable, fiber1)
add_kevent(fd, readable, fiber2)
The 2nd registration clobbers the first. In Async 1.x this was an error, but Async 2 might not be able to be so strict and still get the level of concurrency we want. Also, io_uring
handles this case correctly, so on the fast path (modern Linux) this is less of an issue.
thanks for the great work on async 2.0. i have the same issue here after trying upgrading to 2.0 and ruby 3.1 on os x - newlines missing from output and puts() hanging.
appending "\n" to strings before put()ing them seem to work as a workaround.
I believe the root issue might be fixed here https://github.com/socketry/io-event/pull/20
I believe this should be addressed by io-event
v1.0.2 - can you please test it?
I'm on io-event
v1.0.2 and do not see this issue.
output
Root start
Sleep 1 StartSleep 2 Start
Root end
Sleep 1 EndSleep 2 End
Unfortunately v1.0.2 of io-event
didn't fully resolve the issue for me…it's adding a widely varying number of additional seconds to the build process and the output is strange. The build process with async 1.30.1:
Generating…
Prismic API: Connecting...
Prismic API: Loading blog_post...
Prismic API: Loading page...
Pagination: Complete, processed 1 pagination page(s)
Done! 🎉 Completed in less than 0.48 seconds.
and with async 2.0.0 + io-event 1.0.2:
Generating…
Prismic API: Connecting...
Prismic API: Loading blog_post... Prismic API: Loading page...
Pagination: Complete, processed 1 pagination page(s)
Done! 🎉 Completed in less than 17.89 seconds.
This is when using ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [arm64-darwin21]
.
@jaredcwhite is the hanging issue fixed at least?
@ioquatix Did some poking around on your Ruby PR and Ruby io in general, maybe this helps:
When I boot irb, $stdout.sync
== true
, i.e. $stdout is initialized for synchronous IO (immediately flush to OS after write).
If I set $stdout.sync = false
before proceeding with the non-newline example we've been using in this issue, I get clean output:
load 'scratch/async/async_test_case2.rb'
Sleep 1 StartSleep 2 Start
Root end
Sleep 1 EndSleep 2 End
=> true
irb(main):002:0> $stdout.sync = false
=> false
irb(main):003:0> load 'scratch/async/async_test_case2.rb'
Root start
Sleep 1 Start
Root end
Sleep 2 Start
Sleep 1 End
Sleep 2 End
=> true
irb(main):004:0> $stdout.sync = true # set sync back to true; not expecting this to change the output back, but rather just confirming the comment here https://github.com/ioquatix/ruby/blob/d4a39523c8dab50768f33e1719523f3fc2a4fbe4/include/ruby/io.h#L254
=> true
irb(main):005:0> load 'scratch/async/async_test_case2.rb'
Root start
Sleep 1 Start
Root end
Sleep 2 Start
Sleep 1 End
Sleep 2 End
=> true
I believe the root cause is that when $stdout.sync = true
(the default), the write_lock mutex never gets initialized due to this logic: https://github.com/ioquatix/ruby/blob/d4a39523c8dab50768f33e1719523f3fc2a4fbe4/io.c#L1692
Seems like write_lock
is coupled to the presence of a write buffer; maybe with your PR the write_lock can be initialized regardless?
I implemented the change I described above here and it fixes the issue (it produces the clean non-mangled output [and doesn't hang]) https://github.com/ruby/ruby/commit/756b25466fda0c829498853cbd13c8ac97288bab
Output
Root start
Sleep 1 Start
Root end
Sleep 2 Start
Sleep 1 End
Sleep 2 End
This should be completely fixed on Ruby head and in Ruby 3.2.0 when it is released.
cc @mame FYI.
Ruby 3.1.0, async 2.0.0
There's a high probably that this isn't actually a bug, but in case it is:
This example works:
and produces the following
But the following example, which simply removes the newlines, hangs:
Here's the output it produces before it hangs:
I'm guessing this has something to do with sharing the stdout IO object between Fibers, and that newlines flush the IO object to put it in a more usable state to pass to another fiber, but why would that cause a hang? Is that expected?