Process crash when handling traps

HoneyryderChuck commented 7 years ago

I'm currently experiencing this issue, which can't be consistently reproduced, but consistently happens in the same code path.

I'm using the pattern of using a pipe to control the lifecycle of the process/loop. This is the simplified version of the trigger:

Signal.trap("TERM") do
  @writer.print("TERM")
  @writer.flush
end

The reader in the main thread will deal with the TERM signal, and write to another pipe, which reader is registered in a NIO loop. The registered handler should evaluate the message, and stop the loop. This is the intended behaviour, and it does happen most of the time.

Two types of errors happen from time to time, however:

the writer from the loop-registered pipe raises IOError: stream closed on write (although the reader successfully received and handled the message; simply rescuing the exception "patches" this behaviour, doesn't fix it however)
More rarely, the event above triggers a VM level crash.

This happens usually under heavy load like when I run tests which start/stop many loop instances. When running them sequentially, this happens rarely. Now I've added minitest/hell, and I'm seeing it way more often. This leads me to believe that there is some race condition somewhere, and would greatly appreciate some input on how to debug this.

This is the relevant information I can gather:

Ruby 2.3.1p112
Mac OS Sierra 10.12.4
pooler API is the select syscall

Here's the relevant coredump (I'll ignore the non-relevant ruby-platform threads until someone asks otherwise):

-- Control frame information -----------------------------------------------
c:0004 p:---- s:0013 e:000012 CFUNC  :select
c:0003 p:0120 s:0009 e:000008 METHOD 

-- Machine register context ------------------------------------------------
 rax: 0x0000000024ce02b0 rbx: 0x00007ff326154da0 rcx: 0x0000000024ce02b0
 rdx: 0x00007ff3b94d58cc rdi: 0x00007ff32632d130 rsi: 0x000000010bcf1080
 rbp: 0x000070000d1524f0 rsp: 0x000070000d1524e0  r8: 0x0000000000000000
  r9: 0x0000000000000000 r10: 0x0000000000000000 r11: 0x0000000000000000
 r12: 0x000000010b3a9c20 r13: 0x0000000000000000 r14: 0x00007ff326154da0
 r15: 0x00007ff326154dc8 rip: 0x000000010b914e17 rfl: 0x0000000000010202

-- C level backtrace information -------------------------------------------
0   ruby                                0x000000010b2feb94 rb_vm_bugreport + 388
1   ruby                                0x000000010b19ac8a rb_bug_context + 490
2   ruby                                0x000000010b26f184 sigsegv + 68
3   libsystem_platform.dylib            0x00007fffb2260b3a _sigtramp + 26
4   nio4r_ext.bundle                    0x000000010b914e17 ev_invoke_pending + 119
5   ???                                 0x41d6392e4e1cb1b8 0x0 + 4744042128523178424

...
Thread 4:: server.rb:18
0   libsystem_kernel.dylib          0x00007fffb217feb6 __select + 10
1   nio4r_ext.bundle                0x000000010b919432 select_poll + 162 (ev_select.c:170)
2   ruby                            0x000000010b30724d rb_thread_call_without_gvl + 93
3   nio4r_ext.bundle                0x000000010b915119 ev_run + 745 (ev.c:3755)
4   nio4r_ext.bundle                0x000000010b91a68e NIO_Selector_select_synchronized + 142 (selector.c:352)
5   ruby                            0x000000010b1a4649 rb_ensure + 169
6   nio4r_ext.bundle                0x000000010b919f3a NIO_Selector_select + 106 (selector.c:320)
7   ruby                            0x000000010b2f4e8a vm_call_cfunc + 314
8   ruby                            0x000000010b2df3f8 vm_exec_core + 9720
9   ruby                            0x000000010b2ef6c3 vm_exec + 131
10  ruby                            0x000000010b2edfd4 vm_invoke_proc + 196
11  ruby                            0x000000010b30d857 thread_start_func_2 + 1463
12  ruby                            0x000000010b30d27a thread_start_func_1 + 170
13  libsystem_pthread.dylib         0x00007fffb226a9af _pthread_body + 180
14  libsystem_pthread.dylib         0x00007fffb226a8fb _pthread_start + 286
15  libsystem_pthread.dylib         0x00007fffb226a101 thread_start + 13

Thread 5 Crashed:: server.rb:18
0   libsystem_kernel.dylib          0x00007fffb217fd42 __pthread_kill + 10
1   libsystem_pthread.dylib         0x00007fffb226d5bf pthread_kill + 90
2   libsystem_c.dylib               0x00007fffb20e5420 abort + 129
3   ruby                            0x000000010b19aa99 die + 9
4   ruby                            0x000000010b19acde rb_bug_context + 574
5   ruby                            0x000000010b26f184 sigsegv + 68
6   libsystem_platform.dylib        0x00007fffb2260b3a _sigtramp + 26

...

Thread 5 crashed with X86 Thread State (64-bit):
  rax: 0x0000000000000000  rbx: 0x0000000000000006  rcx: 0x00007ff3251377f8  rdx: 0x0000000000000000
  rdi: 0x0000000000001617  rsi: 0x0000000000000006  rbp: 0x00007ff325137820  rsp: 0x00007ff3251377f8
   r8: 0x0000000000000040   r9: 0x00007fffbafb4040  r10: 0x0000000008000000  r11: 0x0000000000000206
  r12: 0x00007ff325137950  r13: 0x000000000000003e  r14: 0x000070000d153000  r15: 0x000000010b3243f6
  rip: 0x00007fffb217fd42  rfl: 0x0000000000000206  cr2: 0x00007fffbafb2128

...
External Modification Summary:
  Calls made by other processes targeting this process:
    task_for_pid: 0
    thread_create: 0
    thread_set_state: 0
  Calls made by this process:
    task_for_pid: 0
    thread_create: 0
    thread_set_state: 0
  Calls made by all processes on this machine:
    task_for_pid: 4376575
    thread_create: 0
    thread_set_state: 0

VM Region Summary:
ReadOnly portion of Libraries: Total=201.0M resident=0K(0%) swapped_out_or_unallocated=201.0M(100%)
Writable regions: Total=78.5M written=0K(0%) resident=0K(0%) swapped_out=0K(0%) unallocated=78.5M(100%)

                                VIRTUAL   REGION 
REGION TYPE                        SIZE    COUNT (non-coalesced) 
===========                     =======  ======= 
Kernel Alloc Once                    8K        2 
MALLOC                            65.2M       15 
MALLOC guard page                   16K        4 
STACK GUARD                       56.0M        7 
Stack                             13.1M        8 
Stack Guard                          4K        2 
__DATA                            9720K      162 
__LINKEDIT                       116.2M       26 
__TEXT                            84.9M      161 
__UNICODE                          556K        2 
shared memory                       12K        4 
===========                     =======  ======= 
TOTAL                            345.4M      382

HoneyryderChuck commented 7 years ago

I'm narrowing it down in the C extension to which call exactly triggers the pointer being freed was not allocated, And it's in selector.c:

ev_loop(selector->ev_loop, EVLOOP_ONESHOT);

tarcieri commented 7 years ago

Unfortunately that doesn't really narrow it down very much, as that's where the bulk of libev's functionality is.

There is definitely work needed on signal handling (see #134)

HoneyryderChuck commented 7 years ago

@tarcieri had to pause the investigation, but yes, long story short, it is signal handling.

This one will be maybe hard to reproduce in the current test suite, as it is using rspec, and I don't know if there's an "hell" mode like in minitest. However, after patching the trap call, I've come to this conclusion:

minitest normal mode calls trap 2 times for every test for INFO.

trapping: ["INFO"]

minitest hell mode output is not as consistent, but I'm seeing trapping: ["INFO", "SYSTEM_DEFAULT"] passed as arguments.

If traps are being set all over the place in a GIL-parallel way, this might have side-effects for nio. Maybe one could simulate this by trapping INFO multiple times???

ioquatix commented 7 years ago

Just want to add, that in my experience, minitest does some weird shit with processes. It may not be causing the issue here, but I wouldn't be surprised if it was. I'd suggest making a test case working entirely independent of minitest. If you can supply that, I'll take a look.

HoneyryderChuck commented 7 years ago

I've now resorted to not handling traps in tests (for now), and am just closing descriptors/selector for every test. I was seeing quite a few crashes until recently however, which led me to believe that traps might have been just a red herring.

I was doing something similar to this:

def close
  @selector.close
  @wpipe.close
  @server.close
end

and after a few tests and reactor open/close scenarios, one of them would eventually crash the VM. this even happened in JRuby.

I managed to fix it though, by closing the selector last:

def close
 - @selector.close
  @wpipe.close
  @server.close
+  @selector.close
end

which was quite interesting in itself.

I'll see about getting a reproducible script (I don't know if any other variables in my tests cause this, I just know what the fix was).

ioquatix commented 4 years ago

@HoneyryderChuck if you have time to revisit this and confirm whether it's still an issue that would be super helpful.

socketry / nio4r

Process crash when handling traps #139