socketry / async

An awesome asynchronous event-driven reactor for Ruby.
MIT License
2.05k stars 85 forks source link

Sometimes does not exit on HUP signal #220

Closed jsaak closed 1 year ago

jsaak commented 1 year ago

How to reproduce the error:

$ git clone https://github.com/jsaak/ruby3-tcp-server-mini-benchmark.git
$ cd ruby3-tcp-server-mini-benchmark
$ ruby loop.rb async-scheduler.rb 
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die

And here it hangs forever (ruby 3.2.1, async 2.4.2, Linux 5.10.0-19-amd64 #1 SMP Debian 5.10.149-2 (2022-10-21) x86_64 GNU/Linux)

ioquatix commented 1 year ago

Hmm, weird, I'll take a look.

ioquatix commented 1 year ago

It might be that the signal got swallowed some how:

(lldb) thread backtrace all
* thread #1, name = 'ruby', stop reason = signal SIGSTOP
  * frame #0: 0x00007f63e66b69f7 liburing.so.2`___lldb_unnamed_symbol64 + 295
    frame #1: 0x00007f63e66b6b5d liburing.so.2`__io_uring_get_cqe + 61
    frame #2: 0x00007f63e6799053 IO_Event.so`select_internal(_arguments=0x00007ffee9926730) at uring.c:715:22
    frame #3: 0x00005620d68e17a3 ruby`rb_nogvl at thread.c:1548:5
    frame #4: 0x00007f63e6799445 IO_Event.so`IO_Event_Selector_URing_select [inlined] select_internal_without_gvl(arguments=0x00007ffee9926730) at uring.c:725:2
    frame #5: 0x00007f63e6799410 IO_Event.so`IO_Event_Selector_URing_select(self=<unavailable>, duration=<unavailable>) at uring.c:801:13
    frame #6: 0x00005620d691fdd7 ruby`vm_call_cfunc_with_frame at vm_insnhelper.c:3268:12
    frame #7: 0x00005620d693c6f4 ruby`vm_exec_core at vm_insnhelper.c:5080:15
    frame #8: 0x00005620d692cd5c ruby`rb_vm_exec at vm.c:2383:22
    frame #9: 0x00005620d693457f ruby`rb_vm_call0(ec=<unavailable>, recv=<unavailable>, id=<unavailable>, argc=<unavailable>, argv=<unavailable>, cme=<unavailable>, kw_splat=0) at vm_eval.c:58:12
    frame #10: 0x00005620d69351bd ruby`rb_check_funcall_default_kw [inlined] rb_vm_call_kw(kw_splat=0, me=<unavailable>, argv=0x0000000000000000, argc=0, id=26529, recv=140067045688080, ec=0x00005620d70180c0) at vm_eval.c:303:12
    frame #11: 0x00005620d69351a0 ruby`rb_check_funcall_default_kw(recv=140067045688080, mid=26529, argc=0, argv=0x0000000000000000, def=36, kw_splat=0) at vm_eval.c:691:12
    frame #12: 0x00005620d68936e3 ruby`rb_fiber_scheduler_set [inlined] rb_fiber_scheduler_close(scheduler=140067045688080) at scheduler.c:236:14
    frame #13: 0x00005620d68936d0 ruby`rb_fiber_scheduler_set(scheduler=4) at scheduler.c:181:9
    frame #14: 0x00005620d6735f66 ruby`rb_ec_fiber_scheduler_finalize(ec=0x00005620d70180c0) at eval.c:140:9
    frame #15: 0x00005620d67362a8 ruby`rb_ec_teardown(ec=0x00005620d70180c0) at eval.c:152:5
    frame #16: 0x00005620d673652b ruby`rb_ec_cleanup(ec=0x00005620d70180c0, ex=<unavailable>) at eval.c:212:9
    frame #17: 0x00005620d6736e22 ruby`ruby_run_node(n=0x00007f63eb59cc10) at eval.c:330:12
    frame #18: 0x00005620d67318c2 ruby`main [inlined] rb_main(argv=0x00007ffee9926f78, argc=2) at main.c:38:12
    frame #19: 0x00005620d67318a1 ruby`main(argc=<unavailable>, argv=<unavailable>) at main.c:57:12
    frame #20: 0x00007f63eb1cc790 libc.so.6`___lldb_unnamed_symbol3140 + 128
    frame #21: 0x00007f63eb1cc84a libc.so.6`__libc_start_main + 138
    frame #22: 0x00005620d6731905 ruby`_start + 37
ioquatix commented 1 year ago

Probably due to this code: https://github.com/socketry/io-event/blob/df40e1ba440cdfbb99439eaa97873cd63e00a82d/ext/io/event/selector/uring.c#L730-L731

ioquatix commented 1 year ago

Actually, after I pressed Ctrl-C SIGINT, I got a backtrace:

waiting for process to die
^C/home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/scheduler.rb:221:in `select': Interrupt
    from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/scheduler.rb:221:in `run_once'
    from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/scheduler.rb:47:in `close'
    from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/scheduler.rb:40:in `ensure in scheduler_close'
    from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/scheduler.rb:40:in `scheduler_close'
/home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/node.rb:108:in `transient?': SIGHUP (SignalException)
    from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/node.rb:45:in `removed'
    from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/list.rb:132:in `remove!'
    from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/list.rb:121:in `remove'
    from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/node.rb:173:in `remove_child'
    from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/node.rb:188:in `consume'
    from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/task.rb:272:in `finish!'
    from /home/samuel/.gem/ruby/3.2.1/gems/async-2.4.2/lib/async/task.rb:259:in `block in schedule'
./loop.rb:24:in `waitpid': Interrupt
    from ./loop.rb:24:in `block in <main>'
    from ./loop.rb:15:in `loop'
    from ./loop.rb:15:in `<main>'

It seems to imply something is wrong with node clean up...???

ioquatix commented 1 year ago

Yeah, looks like it got stuck in the new terminate loop code:

image

ioquatix commented 1 year ago

I added some extra debugging in that loop, something odd is happening:

Waiting for tasks to terminate...
#<Async::Scheduler:0x000000000000003c 1 children (running)>
    #<Async::Task:0x0000000000000050 (stopped)>
        #<Async::Task:0x0000000000000064 (stopped)>
ioquatix commented 1 year ago

Okay, I think I found the problem.

ioquatix commented 1 year ago

This was released in v2.5.0, do you mind testing it?

jsaak commented 1 year ago

It does not fix the issue unfortunately: (using 2.5.0)

you can test it too, just clone my repo: https://github.com/jsaak/ruby3-tcp-server-mini-benchmark.git

ruby3-tcp-server-mini-benchmark$ ruby loop.rb async-scheduler.rb 
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die
^C/home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/scheduler.rb:221:in `select': Interruptloop.rb:24:in `waitpid': Interrupt
    from loop.rb:24:in `block in <main>'
    from loop.rb:15:in `loop'
    from loop.rb:15:in `<main>'

    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/scheduler.rb:221:in `run_once'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/scheduler.rb:47:in `close'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/scheduler.rb:40:in `ensure in scheduler_close'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/scheduler.rb:40:in `scheduler_close'
/home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/node.rb:108:in `transient?': SIGHUP (SignalException)
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/node.rb:45:in `removed'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/list.rb:132:in `remove!'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/list.rb:121:in `remove'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/node.rb:173:in `remove_child'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/node.rb:188:in `consume'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/task.rb:259:in `finish!'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.5.0/lib/async/task.rb:320:in `block in schedule'
ioquatix commented 1 year ago

Okay, thanks for testing it. I also tested it and could not reproduce it after the state handling changes. Let me check again.

ioquatix commented 1 year ago

Do you mind testing https://github.com/socketry/async/pull/223 branch?

jsaak commented 1 year ago

I do not know how to run it without a published package.

Cloning the repo and switching branches is ok. And then what?

ioquatix commented 1 year ago

you can write

gem "async", path: "path/to/async"

in your gem file.

jsaak commented 1 year ago

For me it is still doing it. I have no idea why. here is gdb output, maybe it helps you:

(gdb) bt
#0  0x00007f9169decdd6 in epoll_wait (epfd=5, events=events@entry=0x7ffe16ba093c, maxevents=maxevents@entry=64, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f916a12702e in select_internal (_arguments=0x7ffe16ba0930) at ./io/event/selector/epoll.c:604
#2  0x0000556359f67d32 in rb_nogvl (func=0x7f916a127010 <select_internal>, data1=<optimized out>, ubf=<optimized out>, data2=<optimized out>, flags=<optimized out>) at thread.c:1548
#3  0x00007f916a127207 in select_internal_without_gvl (arguments=0x7ffe16ba0930) at ./io/event/selector/epoll.c:612
#4  IO_Event_Selector_EPoll_select (self=<optimized out>, duration=4) at ./io/event/selector/epoll.c:661
#5  0x0000556359fa450c in vm_call_cfunc_with_frame (ec=0x55635a4bf020, reg_cfp=0x7f9169733e50, calling=<optimized out>) at vm_insnhelper.c:3268
#6  0x0000556359fc1049 in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at vm_callinfo.h:367
#7  vm_exec_core (ec=0x55635a4bf020, initial=140729279711548) at insns.def:820
#8  0x0000556359fb11af in rb_vm_exec (ec=0x55635a4bf020, jit_enable_p=true) at vm.c:2374
#9  0x0000556359fc64cb in invoke_block (captured=<optimized out>, captured=<optimized out>, opt_pc=<optimized out>, type=<optimized out>, cref=0x0, self=140262439803880, iseq=0x7f9164fe9500, ec=0x55635a4bf020) at vm.c:1398
#10 invoke_iseq_block_from_c (me=0x0, is_lambda=<optimized out>, cref=0x0, passed_block_handler=0, kw_splat=0, argv=0x7ffe16ba0f68, argc=1, self=140262439803880, captured=<optimized out>, ec=0x55635a4bf020) at vm.c:1454
#11 invoke_block_from_c_bh (force_blockarg=<optimized out>, is_lambda=<optimized out>, cref=<optimized out>, passed_block_handler=<optimized out>, kw_splat=<optimized out>, argv=<optimized out>, argc=<optimized out>, block_handler=<optimized out>, ec=<optimized out>)
    at vm.c:1472
#12 vm_yield_with_cref (is_lambda=0, cref=0x0, kw_splat=0, argv=0x7ffe16ba0f68, argc=1, ec=0x55635a4bf020) at vm.c:1509
#13 vm_yield (kw_splat=0, argv=0x7ffe16ba0f68, argc=1, ec=0x55635a4bf020) at vm.c:1517
#14 rb_yield_0 (argv=0x7ffe16ba0f68, argc=1) at vm_eval.c:1348
#15 rb_yield (val=<optimized out>, val@entry=4) at vm_eval.c:1364
#16 0x0000556359f62956 in rb_thread_s_handle_interrupt (self=<optimized out>, mask_arg=<optimized out>) at thread.c:2143
#17 0x0000556359fa450c in vm_call_cfunc_with_frame (ec=0x55635a4bf020, reg_cfp=0x7f9169733f10, calling=<optimized out>) at vm_insnhelper.c:3268
#18 0x0000556359fa913e in vm_sendish (ec=0x55635a4bf020, reg_cfp=0x7f9169733f10, cd=0x55635a60f350, block_handler=<optimized out>, method_explorer=mexp_search_method) at vm_callinfo.h:367
#19 0x0000556359fc3054 in vm_exec_core (ec=0x55635a4bf020, initial=140729279711548) at insns.def:801
#20 0x0000556359fb1985 in rb_vm_exec (ec=0x55635a4bf020, jit_enable_p=true) at vm.c:2383
#21 0x0000556359db9978 in rb_ec_exec_node (ec=ec@entry=0x55635a4bf020, n=n@entry=0x7f9164f8a348) at eval.c:289
#22 0x0000556359dbe959 in ruby_run_node (n=0x7f9164f8a348) at eval.c:330
#23 0x0000556359db9713 in rb_main (argv=0x7ffe16ba15d8, argc=2) at ./main.c:38
#24 main (argc=<optimized out>, argv=<optimized out>) at ./main.c:57

Since i am not using bundler Gemfile did not work. This is what i have done. Maybe it is wrong?

  1. git clone async
  2. git checkout prefer-thread-handle_interrupt
  3. remove signing things from async.gemspec
  4. updated version to 2.5.1
  5. gem build asnyc.gemspec
  6. gem uninstall async (2.5.0)
  7. gem install asnyc from path
  8. ruby loop.rb asnyc.rb
ioquatix commented 1 year ago

I will test it again, thanks for more data.

jsaak commented 1 year ago

If you can not reproduce, maybe i can make an ssh account to my computer for you.

ioquatix commented 1 year ago

I could reproduce it the first time, but after my changes it did not happen again even after running for an hour. But I have an idea, I'll add some code to make interrupts more common in the code path which is triggering the bug, and I hope with that idea we can make a test/regression for it.

jsaak commented 1 year ago

It is happening more rarely for me too since 2.5.0.

ioquatix commented 1 year ago

Do you mind checking then latest 2.6.1 release, the terminate method is more robust now, I couldn't reproduce the issue, but will continue trying.

jsaak commented 1 year ago

I will.

ioquatix commented 1 year ago

I have been playing around with the benchmark script and for some reason, at least on macOS, it's not very consistent (Requests/s can vary pretty wildly). I don't know why. Anyway, I left it running for an hour and it did not fail.

ioquatix commented 1 year ago

I retried 2.5.1 on Linux and could reproduce the failure within < 20 iterations of loop.rb. I tried 2.6.1 in the same environment and could not reproduce the failure even after 100+ iterations. I'll leave it running over night to see if it shows up.

jsaak commented 1 year ago

loop.rb async.rb works, but the bug is still present when testing the scheduler only loop.rb async-scheduler.rb:

At first run, it hanged at second waitpid At second run, it hanged after about 30. At third run, it was about 20.

ruby process is at 100% cpu, when it happens

I hope it will make sense to you.

$ ruby -v   
ruby 3.2.1 (2023-02-08 revision 31819e82c8) [x86_64-linux]
$ gem list | grep async
async (2.6.1)
async-io (1.34.3)

ruby3-tcp-server-mini-benchmark$ ruby loop.rb async-scheduler.rb 
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die
spawn
/usr/bin/wrk -t1 -c1 -d1s http://localhost:9090
waiting for process to die
^Cloop.rb:24:in `waitpid': Interrupt
    from loop.rb:24:in `block in <main>'
    from loop.rb:15:in `loop'
    from loop.rb:15:in `<main>'
/home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/list.rb:250:in `initialize': Interrupt
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/list.rb:298:in `new'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/list.rb:298:in `each'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/list.rb:176:in `each'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/node.rb:240:in `terminate'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/scheduler.rb:52:in `close'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/scheduler.rb:46:in `ensure in scheduler_close'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/scheduler.rb:46:in `scheduler_close'
/home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/node.rb:110:in `transient?': SIGHUP (SignalException)
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/node.rb:47:in `removed'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/list.rb:132:in `remove!'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/list.rb:121:in `remove'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/node.rb:182:in `remove_child'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/node.rb:197:in `consume'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/task.rb:287:in `finish!'
    from /home/jsaak/.gem/ruby/3.2.1/gems/async-2.6.1/lib/async/task.rb:360:in `block in schedule'
ioquatix commented 1 year ago

Thanks that is helpful.

I also found a bug in loop.rb, I think you need to call io.close on the popen(wrk). Otherwise the process may never be cleaned up. However it appears there is still an issue cleaning up in async scheduler_close after receiving sighup.

ioquatix commented 1 year ago

I've been messing around on my own branch: https://github.com/ioquatix/ruby3-tcp-server-mini-benchmark/commits/master just FYI.

ioquatix commented 1 year ago

Do you mind testing the latest v2.6.2 release? It explicitly ignores sighup until specific points in the event loop.