Closed jsaak closed 1 year ago
Hmm, weird, I'll take a look.
It might be that the signal got swallowed some how:
(lldb) thread backtrace all
* thread #1, name = 'ruby', stop reason = signal SIGSTOP
* frame #0: 0x00007f63e66b69f7 liburing.so.2`___lldb_unnamed_symbol64 + 295
frame #1: 0x00007f63e66b6b5d liburing.so.2`__io_uring_get_cqe + 61
frame #2: 0x00007f63e6799053 IO_Event.so`select_internal(_arguments=0x00007ffee9926730) at uring.c:715:22
frame #3: 0x00005620d68e17a3 ruby`rb_nogvl at thread.c:1548:5
frame #4: 0x00007f63e6799445 IO_Event.so`IO_Event_Selector_URing_select [inlined] select_internal_without_gvl(arguments=0x00007ffee9926730) at uring.c:725:2
frame #5: 0x00007f63e6799410 IO_Event.so`IO_Event_Selector_URing_select(self=<unavailable>, duration=<unavailable>) at uring.c:801:13
frame #6: 0x00005620d691fdd7 ruby`vm_call_cfunc_with_frame at vm_insnhelper.c:3268:12
frame #7: 0x00005620d693c6f4 ruby`vm_exec_core at vm_insnhelper.c:5080:15
frame #8: 0x00005620d692cd5c ruby`rb_vm_exec at vm.c:2383:22
frame #9: 0x00005620d693457f ruby`rb_vm_call0(ec=<unavailable>, recv=<unavailable>, id=<unavailable>, argc=<unavailable>, argv=<unavailable>, cme=<unavailable>, kw_splat=0) at vm_eval.c:58:12
frame #10: 0x00005620d69351bd ruby`rb_check_funcall_default_kw [inlined] rb_vm_call_kw(kw_splat=0, me=<unavailable>, argv=0x0000000000000000, argc=0, id=26529, recv=140067045688080, ec=0x00005620d70180c0) at vm_eval.c:303:12
frame #11: 0x00005620d69351a0 ruby`rb_check_funcall_default_kw(recv=140067045688080, mid=26529, argc=0, argv=0x0000000000000000, def=36, kw_splat=0) at vm_eval.c:691:12
frame #12: 0x00005620d68936e3 ruby`rb_fiber_scheduler_set [inlined] rb_fiber_scheduler_close(scheduler=140067045688080) at scheduler.c:236:14
frame #13: 0x00005620d68936d0 ruby`rb_fiber_scheduler_set(scheduler=4) at scheduler.c:181:9
frame #14: 0x00005620d6735f66 ruby`rb_ec_fiber_scheduler_finalize(ec=0x00005620d70180c0) at eval.c:140:9
frame #15: 0x00005620d67362a8 ruby`rb_ec_teardown(ec=0x00005620d70180c0) at eval.c:152:5
frame #16: 0x00005620d673652b ruby`rb_ec_cleanup(ec=0x00005620d70180c0, ex=<unavailable>) at eval.c:212:9
frame #17: 0x00005620d6736e22 ruby`ruby_run_node(n=0x00007f63eb59cc10) at eval.c:330:12
frame #18: 0x00005620d67318c2 ruby`main [inlined] rb_main(argv=0x00007ffee9926f78, argc=2) at main.c:38:12
frame #19: 0x00005620d67318a1 ruby`main(argc=<unavailable>, argv=<unavailable>) at main.c:57:12
frame #20: 0x00007f63eb1cc790 libc.so.6`___lldb_unnamed_symbol3140 + 128
frame #21: 0x00007f63eb1cc84a libc.so.6`__libc_start_main + 138
frame #22: 0x00005620d6731905 ruby`_start + 37
Actually, after I pressed Ctrl-C SIGINT
, I got a backtrace:
waiting for process to die
^C/home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/scheduler.rb:221:in `select': Interrupt
from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/scheduler.rb:221:in `run_once'
from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/scheduler.rb:47:in `close'
from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/scheduler.rb:40:in `ensure in scheduler_close'
from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/scheduler.rb:40:in `scheduler_close'
/home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/node.rb:108:in `transient?': SIGHUP (SignalException)
from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/node.rb:45:in `removed'
from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/list.rb:132:in `remove!'
from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/list.rb:121:in `remove'
from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/node.rb:173:in `remove_child'
from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/node.rb:188:in `consume'
from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/task.rb:272:in `finish!'
from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/task.rb:259:in `block in schedule'
./loop.rb:24:in `waitpid': Interrupt
from ./loop.rb:24:in `block in <main>'
from ./loop.rb:15:in `loop'
from ./loop.rb:15:in `<main>'
It seems to imply something is wrong with node clean up...???
Yeah, looks like it got stuck in the new terminate loop code:
I added some extra debugging in that loop, something odd is happening:
Waiting for tasks to terminate...
#<Async::Scheduler:0x000000000000003c 1 children (running)>
#<Async::Task:0x0000000000000050 (stopped)>
#<Async::Task:0x0000000000000064 (stopped)>
Okay, I think I found the problem.
This was released in v2.5.0, do you mind testing it?
It does not fix the issue unfortunately: (using 2.5.0)
you can test it too, just clone my repo: https://github.com/jsaak/ruby3-tcp-server-mini-benchmark.git
ruby3-tcp-server-mini-benchmark$ ruby loop.rb async-scheduler.rb
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die
^C/home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/scheduler.rb:221:in `select': Interruptloop.rb:24:in `waitpid': Interrupt
from loop.rb:24:in `block in <main>'
from loop.rb:15:in `loop'
from loop.rb:15:in `<main>'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/scheduler.rb:221:in `run_once'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/scheduler.rb:47:in `close'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/scheduler.rb:40:in `ensure in scheduler_close'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/scheduler.rb:40:in `scheduler_close'
/home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/node.rb:108:in `transient?': SIGHUP (SignalException)
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/node.rb:45:in `removed'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/list.rb:132:in `remove!'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/list.rb:121:in `remove'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/node.rb:173:in `remove_child'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/node.rb:188:in `consume'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/task.rb:259:in `finish!'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/task.rb:320:in `block in schedule'
Okay, thanks for testing it. I also tested it and could not reproduce it after the state handling changes. Let me check again.
Do you mind testing https://github.com/socketry/async/pull/223 branch?
I do not know how to run it without a published package.
Cloning the repo and switching branches is ok. And then what?
you can write
gem "async", path: "path/to/async"
in your gem file.
For me it is still doing it. I have no idea why. here is gdb output, maybe it helps you:
(gdb) bt
#0 0x00007f9169decdd6 in epoll_wait (epfd=5, events=events@entry=0x7ffe16ba093c, maxevents=maxevents@entry=64, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00007f916a12702e in select_internal (_arguments=0x7ffe16ba0930) at ./io/event/selector/epoll.c:604
#2 0x0000556359f67d32 in rb_nogvl (func=0x7f916a127010 <select_internal>, data1=<optimized out>, ubf=<optimized out>, data2=<optimized out>, flags=<optimized out>) at thread.c:1548
#3 0x00007f916a127207 in select_internal_without_gvl (arguments=0x7ffe16ba0930) at ./io/event/selector/epoll.c:612
#4 IO_Event_Selector_EPoll_select (self=<optimized out>, duration=4) at ./io/event/selector/epoll.c:661
#5 0x0000556359fa450c in vm_call_cfunc_with_frame (ec=0x55635a4bf020, reg_cfp=0x7f9169733e50, calling=<optimized out>) at vm_insnhelper.c:3268
#6 0x0000556359fc1049 in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at vm_callinfo.h:367
#7 vm_exec_core (ec=0x55635a4bf020, initial=140729279711548) at insns.def:820
#8 0x0000556359fb11af in rb_vm_exec (ec=0x55635a4bf020, jit_enable_p=true) at vm.c:2374
#9 0x0000556359fc64cb in invoke_block (captured=<optimized out>, captured=<optimized out>, opt_pc=<optimized out>, type=<optimized out>, cref=0x0, self=140262439803880, iseq=0x7f9164fe9500, ec=0x55635a4bf020) at vm.c:1398
#10 invoke_iseq_block_from_c (me=0x0, is_lambda=<optimized out>, cref=0x0, passed_block_handler=0, kw_splat=0, argv=0x7ffe16ba0f68, argc=1, self=140262439803880, captured=<optimized out>, ec=0x55635a4bf020) at vm.c:1454
#11 invoke_block_from_c_bh (force_blockarg=<optimized out>, is_lambda=<optimized out>, cref=<optimized out>, passed_block_handler=<optimized out>, kw_splat=<optimized out>, argv=<optimized out>, argc=<optimized out>, block_handler=<optimized out>, ec=<optimized out>)
at vm.c:1472
#12 vm_yield_with_cref (is_lambda=0, cref=0x0, kw_splat=0, argv=0x7ffe16ba0f68, argc=1, ec=0x55635a4bf020) at vm.c:1509
#13 vm_yield (kw_splat=0, argv=0x7ffe16ba0f68, argc=1, ec=0x55635a4bf020) at vm.c:1517
#14 rb_yield_0 (argv=0x7ffe16ba0f68, argc=1) at vm_eval.c:1348
#15 rb_yield (val=<optimized out>, val@entry=4) at vm_eval.c:1364
#16 0x0000556359f62956 in rb_thread_s_handle_interrupt (self=<optimized out>, mask_arg=<optimized out>) at thread.c:2143
#17 0x0000556359fa450c in vm_call_cfunc_with_frame (ec=0x55635a4bf020, reg_cfp=0x7f9169733f10, calling=<optimized out>) at vm_insnhelper.c:3268
#18 0x0000556359fa913e in vm_sendish (ec=0x55635a4bf020, reg_cfp=0x7f9169733f10, cd=0x55635a60f350, block_handler=<optimized out>, method_explorer=mexp_search_method) at vm_callinfo.h:367
#19 0x0000556359fc3054 in vm_exec_core (ec=0x55635a4bf020, initial=140729279711548) at insns.def:801
#20 0x0000556359fb1985 in rb_vm_exec (ec=0x55635a4bf020, jit_enable_p=true) at vm.c:2383
#21 0x0000556359db9978 in rb_ec_exec_node (ec=ec@entry=0x55635a4bf020, n=n@entry=0x7f9164f8a348) at eval.c:289
#22 0x0000556359dbe959 in ruby_run_node (n=0x7f9164f8a348) at eval.c:330
#23 0x0000556359db9713 in rb_main (argv=0x7ffe16ba15d8, argc=2) at ./main.c:38
#24 main (argc=<optimized out>, argv=<optimized out>) at ./main.c:57
Since i am not using bundler Gemfile did not work. This is what i have done. Maybe it is wrong?
I will test it again, thanks for more data.
If you can not reproduce, maybe i can make an ssh account to my computer for you.
I could reproduce it the first time, but after my changes it did not happen again even after running for an hour. But I have an idea, I'll add some code to make interrupts more common in the code path which is triggering the bug, and I hope with that idea we can make a test/regression for it.
It is happening more rarely for me too since 2.5.0.
Do you mind checking then latest 2.6.1 release, the terminate method is more robust now, I couldn't reproduce the issue, but will continue trying.
I will.
I have been playing around with the benchmark script and for some reason, at least on macOS, it's not very consistent (Requests/s can vary pretty wildly). I don't know why. Anyway, I left it running for an hour and it did not fail.
I retried 2.5.1 on Linux and could reproduce the failure within < 20 iterations of loop.rb
. I tried 2.6.1 in the same environment and could not reproduce the failure even after 100+ iterations. I'll leave it running over night to see if it shows up.
loop.rb async.rb
works, but the bug is still present when testing the scheduler only loop.rb async-scheduler.rb
:
At first run, it hanged at second waitpid At second run, it hanged after about 30. At third run, it was about 20.
ruby process is at 100% cpu, when it happens
I hope it will make sense to you.
$ ruby -v
ruby 3.2.1 (2023-02-08 revision 31819e82c8) [x86_64-linux]
$ gem list | grep async
async (2.6.1)
async-io (1.34.3)
ruby3-tcp-server-mini-benchmark$ ruby loop.rb async-scheduler.rb
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die
^Cloop.rb:24:in `waitpid': Interrupt
from loop.rb:24:in `block in <main>'
from loop.rb:15:in `loop'
from loop.rb:15:in `<main>'
/home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/list.rb:250:in `initialize': Interrupt
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/list.rb:298:in `new'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/list.rb:298:in `each'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/list.rb:176:in `each'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/node.rb:240:in `terminate'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/scheduler.rb:52:in `close'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/scheduler.rb:46:in `ensure in scheduler_close'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/scheduler.rb:46:in `scheduler_close'
/home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/node.rb:110:in `transient?': SIGHUP (SignalException)
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/node.rb:47:in `removed'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/list.rb:132:in `remove!'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/list.rb:121:in `remove'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/node.rb:182:in `remove_child'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/node.rb:197:in `consume'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/task.rb:287:in `finish!'
from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/task.rb:360:in `block in schedule'
Thanks that is helpful.
I also found a bug in loop.rb
, I think you need to call io.close
on the popen(wrk)
. Otherwise the process may never be cleaned up. However it appears there is still an issue cleaning up in async scheduler_close
after receiving sighup.
I've been messing around on my own branch: https://github.com/ioquatix/ruby3-tcp-server-mini-benchmark/commits/master just FYI.
Do you mind testing the latest v2.6.2 release? It explicitly ignores sighup until specific points in the event loop.
How to reproduce the error:
And here it hangs forever (ruby 3.2.1, async 2.4.2, Linux 5.10.0-19-amd64 #1 SMP Debian 5.10.149-2 (2022-10-21) x86_64 GNU/Linux)