Illegal recovery of 2 stack overflows with ocamlopt in Mac OS

vicuna commented 11 years ago

Original bug ID: 5976 Reporter: pboutill Assigned to: @xavierleroy Status: closed (set by @xavierleroy on 2015-12-11T18:19:31Z) Resolution: fixed Priority: normal Severity: major Platform: x86_64 OS: MacOS OS Version: 10.5-10.8 Version: 4.00.1 Target version: 4.01.0+dev Fixed in version: 4.01.0+dev Category: runtime system and C interface

Bug description

The following code produces the output

Illegal instruction: 4

(only while compile in native)

Steps to reproduce

( compile the following code with ocamlopt ) let rec f () = f () ; f ()

let rec loop i = if i <= 0 then print_string "OK\n" else try f () with Stack_overflow -> loop (pred i)

let () = loop 2 ( works for 1 )

vicuna commented 11 years ago

Comment author: @ppedrot

This bug is cumbersome in Coq, because whenever a computation raises a Stack_overflow, the user cannot do anything but restart coqtop to recover properly the next Stack_overflow failure.

vicuna commented 11 years ago

Comment author: @alainfrisch

It is also known that stack flow recovery does not work well under Windows. What about a mode where the runtime would stop cleanly, with a proper error message, upon stack overflow, instead of trying to recover from it?

vicuna commented 11 years ago

Comment author: @xavierleroy

The same Caml code works fine under Linux x86-64, so there's something specific to MacOS X to be investigated.

@frisch: stack overflow as clean fatal error wouldn't help with the Coq use case mentioned by ppedrot. Also, even printing an error message can be challenging when your program is really out of stack space. But I welcome sample implementations, esp. for Windows.

vicuna commented 11 years ago

Comment author: @xavierleroy

Further investigations: I tried to reproduce the problem in pure C code, using setjmp/longjmp to simulate exceptions, and the problem does not show up. Looking further into the implementation of longjmp() on MacOS X, it appears that it goes to great lengths to call the undocumented "sigreturn" syscall when exiting from a signal handler. I have the impression that this is especially important when the signal was taken on an alternate stack.

My theory at this point is as follows: the OCaml runtime exits the handler for the stack overflow signal by raising an OCaml exception. This cuts the stack just fine, but does not call "sigreturn". As a consequence, the alternate stack for this handler may not be reset properly, and taking a second stack overflow signal on this alternate stack causes the kernel to abort the program.

This needs to be confirmed further, knowing that gdb under MacOS X is unable to step through a SIGSEGV signal handler...

A possible workaround would be to simulate the raising of the Stack_overflow exception from within the signal handler, by tweaking the saved registers from the ucontext, then returning "normally". This would be a major hack and I'm unsure it can be done in time for release 4.01.

vicuna commented 11 years ago

Comment author: @xavierleroy

Tentative fix in trunk, commits r13759 and r13760. The fix is to return normally from segv_handler, after changing the PC in the signal context to point to caml_stack_overflow in amd64.S, which actually raises the exception. Whether to use this trick is governed by RETURN_AFTER_STACK_OVERFLOW defined or not in asmrun/signals_osdep.h. For the time being, it is defined only for amd64/macosx.

Note: stack backtraces on Stack_overflow exceptions were not reliably recorded by the old implementation, to begin with, but this alternate implementation makes it fundamentally impossible to record them, as we don't have the stack space required to do so. This could be an additional reason to stick to the old implementation on all platforms where it works.

ocaml / ocaml

Illegal recovery of 2 stack overflows with ocamlopt in Mac OS #5976

Bug description

Steps to reproduce