GAP Bernoulli function crashes on Cygwin with system GMP

embray commented 5 years ago

It is crashing in various different (depending on the arguments I give) GMP functions for large integers, with segmentation faults in the assembly code.

The situation is bad enough that it causes irrecoverable stack corruption and the process dies without Cygwin's exception handling even able to run. I don't know that the problem is likely anything specifically to do with the Bernoulli function--that's just the context in which I've been able to reproduce the problem reliably.

For the most part GAP otherwise works fine and most operations do not cause any problems at all, so it is likely a narrow case. Also strangely, if I remove the -m 64m flag when running GAP (a default argument used when running sage -gap) the crash does not occur. Likewise if I provide a larger value like -m 128m it does not occur. I need to double-check exactly what the effect of this argument is.

In the meantime, between this and #27721 it might be a good idea to disable use of the system GMP on Cygwin altogether until and unless I can get to the bottom of this.

Upstream issue for GAP: https://github.com/gap-system/gap/issues/3434

Upstream PR for GAP: https://github.com/gap-system/gap/pull/3435

Upstream: Fixed upstream, in a later stable release.

CC: @slel

Component: packages: standard

Keywords: cygwin gap gmp

Author: Erik Bray

Branch/Commit: 1077f3d

Reviewer: Samuel Lelièvre

Issue created by migration from https://trac.sagemath.org/ticket/27724

embray commented 5 years ago

comment:2

The reason this occurs specifically with Bernoulli makes sense to me now, as does the interaction with the -m flag. GAP memoizes each subresult, so computing something like Bernoulli(1724) probably hits GAP's initial workspace size requiring some reallocation (though I'm not sure what the implication of this is in the context of it using an mmap'd memory pool.

This is the point at which it crashes. Still not sure why though, since any memory allocation would be happening at the point of trying to allocate a new Bag, at which point it should just work. But the crashes are happening inside GMP functions. Maybe somehow reallocating moves some bags around that contain data still being used by GMP? I'm not sure.

embray commented 5 years ago

comment:3

I think this might be a reprise of #27214. The problem, I think seems to be that GMP's assembly routines are confusing Windows' stack unwinding on exceptions, and thus preventing the Cygwin exception handler from being called.

embray commented 5 years ago

comment:4

Indeed, I can demonstrate the problem with this example code:

$ cat test.c

#include <gmp.h>
#include <stdlib.h>

int main(void) {
    mp_limb_t *rlp;
    mp_size_t qxn = 0;
    mp_limb_t s2p[1] = {0};
    mp_size_t s2n = 1;
    mp_limb_t s3limb = 1;

    /* Use of this function in particular is arbitrary aside from
     * the fact that it is known to demonstrate the problem on my
     * system (my system uses an assembly implementation for it)
     */

    /* Just set to something that will segfault when accessed */
    rlp = (mp_limb_t*)0x1234;
    mpn_divrem_1(rlp, qxn, s2p, s2n, s3limb);
    return 1;
}

$ gcc test.c -lgmp
$ ./a.exe && echo $?
0

It should print that it got a SIGSEGV and exit with the corresponding exit code (139). A "normal" program dereferencing an invalid pointer will do this just fine on Cygwin. Instead, it just exits 0, which is the (somewhat unfortunate) behavior when a program crashes outside Cygwin's control. If I run it in gdb I do see that an access violation occurs inside the assembly implementation of mpn_divrem_1 when it first tries to write something to *rlp.

I've been experimenting with this a bit further trying to understand exactly where things are going wrong, and while it's still not 100% clear to me it might actually be a bug in the NT kernel, or at least tripping some exceptional behavior in exception handling (or maybe a subtle bug in Cygwin related to this).

What I found so far is that in GMP's assembly code it's actually using the %rbp register just as a normal register in its calculations. It saves the old value on the stack and pops it at the end, but this obviously doesn't help if an exception occurs in the function body. It seems that the NT exception handling code hates this. Indeed I can take a relatively "innocent" program that generates a segfault like:

$cat test3.c

#include <stdlib.h>

void foo() {
    volatile int *i = (int *)0xffffffff;
    *i = 0;
}

int main(void) {
    foo();
    return 1;
}

and modify it like

$cat test4.c

#include <stdlib.h>

void foo() {
    volatile int *i = (int *)0xffffffff;
    /* First let's do something weird to rbp, just to check a theory... */
    asm("mov $0, %rbp");
    *i = 0;
}

int main(void) {
    foo();
    return 1;
}

I compile this like

$ gcc -fomit-frame-pointer test4.c

so that it doesn't even use %rbp in the function prologue and doesn't use it to address local variables, so the disassembly of foo in this case looks like:

0000000000000000 <foo>:
   0:   48 83 ec 18             sub    $0x18,%rsp
   4:   b8 ff ff ff ff          mov    $0xffffffff,%eax
   9:   48 89 44 24 08          mov    %rax,0x8(%rsp)
   e:   48 c7 c5 00 00 00 00    mov    $0x0,%rbp
  15:   48 8b 44 24 08          mov    0x8(%rsp),%rax
  1a:   c7 00 00 00 00 00       movl   $0x0,(%rax)
  20:   90                      nop
  21:   48 83 c4 18             add    $0x18,%rsp
  25:   c3                      retq

so it's not even using %rbp (ditto in main) except where I set it to zero.

Of course, in Cygwin, by the time the main function is called there are actually several layers of initialization before that which are quite complex, at the top of which is where Cygwin's exception handler is installed. So of course when entering main the value of %rbp is certainly meaningful for walking back up the stack to the frame the exception handler is installed for.

How exactly Windows does this I'm not sure, but what I do know is when setting %rbp to some bogus value I actually get a second access violation somewhere in a kernel function called RtlWalkFrameChain which is called by KiUserExceptionDispatcher (this I know is the function that locates and dispatches user-mode exception handlers, hence the name). The fact that RtlWalkFrameChain itself blows up feels like a bug to me, though obviously by messing with %rbp we're violating some assumption it's making. This in turn results in re-entering KiUserExceptionDispatcher in an infinite recursion, because it keeps blowing up in RtlWalkFrameChain, until I guess the stack overflows and the kernel gives up on user-mode exception handling and just kills the process. I need to refresh my memory on how exactly this works, as I know a lot of it has been reverse-engineered.

Regardless, the moral of the story is don't abuse %rbp if you expect exception handling to work properly. This is a problem for GMP...

embray commented 5 years ago

comment:5

I think I see now why it's so hell-bent on RBP: For every the compiler generates an entry in the .xdata section ("x" for "exception" I guess) of the PE32 binary, which is a table of UNWIND_INFO structures explained here: https://docs.microsoft.com/en-us/cpp/build/exception-handling-x64?view=vs-2019#raw-pseudo-operations

Each UNWIND_INFO includes an array of special opcodes that explains how that function's frame is set up:

An array of items that explains the effect of the prolog on the nonvolatile registers and RSP.

It also has a "frame register" field which is non-zero if the function does use a frame pointer like RBP:

If nonzero, then the function uses a frame pointer (FP), and this field is the number of the nonvolatile register used as the frame pointer, using the same encoding for the operation info field of UNWIND_CODE nodes.

You can use the dumpbin utility to view these sections, and indeed running it on cygwin1.dll and finding the UNWIND_INFO structure for _cygtls::call (which is essentially the entrypoint to a Cygwin executable's main thread, and where the exception handler is installed) shows:

  00000618 00006E70 00006F77 0026A51C  _ZN7_cygtls5call2EPFjPvS0_ES0_S0_
    Unwind version: 1
    Unwind flags: None
    Size of prologue: 0x0D
    Count of codes: 7
    Frame register: rbp
    Frame offset: 0x0
    Unwind codes:
      0D: ALLOC_SMALL, size=0x20
      09: SET_FPREG, register=rbp, offset=0x00
      06: PUSH_NONVOL, register=rbx
      05: PUSH_NONVOL, register=rsi
      04: PUSH_NONVOL, register=rdi
      03: PUSH_NONVOL, register=r12
      01: PUSH_NONVOL, register=rbp

These opcodes basically tell the stack unwinder exactly what it needs to back out of each function on stack until it gets back to the frame where the exception will be handled.

In GCC there are equivalent pseudo-operations that can be used in assembly (I have seen these before but wasn't really sure how they work) named .seh_* which are used for generating these .pdata and .xdata entries for each function in the object file. You can see these directly in the GCC assembly output on Windows. For example for my test3.c above it outputs:

        .globl  foo
        .def    foo;    .scl    2;      .type   32;     .endef
        .seh_proc       foo
foo:
        pushq   %rbp
        .seh_pushreg    %rbp
        movq    %rsp, %rbp
        .seh_setframe   %rbp, 0
        subq    $16, %rsp
        .seh_stackalloc 16
        .seh_endprologue
        movl    $4294967295, %eax
...

This explicitly delineates which instructions comprise the function's prologue and what they do: push register RBP on the stack, set a frame pointer at 0 offst from RBP, allocate 16 bytes on the stack.

If assembled with the -fomit-frame-pointer flag it's even simpler, as to be expected:

        .globl  foo
        .def    foo;    .scl    2;      .type   32;     .endef
        .seh_proc       foo
foo:
        subq    $24, %rsp
        .seh_stackalloc 24
        .seh_endprologue
        movl    $4294967295, %eax
...

i.e. just allocate 24 bytes on the stack.

I hand modified the assembly for test4.c above so that foo is defined:

        .globl  foo
        .def    foo;    .scl    2;      .type   32;     .endef
        .seh_proc       foo
foo:
        pushq   %rbp
        .seh_pushreg    %rbp
        subq    $24, %rsp
        .seh_stackalloc 24
        .seh_endprologue
        movl    $4294967295, %eax
        movq    %rax, 8(%rsp)
        mov $0, %rbp
        movq    8(%rsp), %rax
        movl    $0, (%rax)
        nop
        addq    $24, %rsp
        popq    %rbp
        ret
        .seh_endproc

So it notes explicitly that %rbp is a non-volatile register that should be saved (it doesn't say anything about it serving in this function as a frame pointer in the case of this function).

But with this bit in place (and it must include the .seh_pushreg or else it doesn't work) the kernel can successfully unwind the stack to where the exception is handled by Cygwin.

So this is what the assembly code in GMP should be doing, but isn't. I'll propose a patch for that and see if I can get it into Cygwin's GMP package if nothing else. In the meantime I'd need another workaround...

embray commented 5 years ago

Changed keywords from none to cygwin gap gmp

embray commented 5 years ago

Upstream: Reported upstream. Developers acknowledge bug.

embray commented 5 years ago

Description changed:

--- 
+++ 
@@ -5,3 +5,5 @@
 For the most part GAP otherwise works fine and most operations do not cause any problems at all, so it is likely a narrow case.  Also strangely, if I remove the `-m 64m` flag when running GAP (a default argument used when running `sage -gap`) the crash does not occur.  Likewise if I provide a larger value like `-m 128m` it does not occur.  I need to double-check exactly what the effect of this argument is.

 In the meantime, between this and #27721 it might be a good idea to disable use of the system GMP on Cygwin altogether until and unless I can get to the bottom of this.
+
+**Upstream issue for GAP:** https://github.com/gap-system/gap/issues/3434

embray commented 5 years ago

comment:6

Added an issue for this on GAP. Have already discussed the issue on Slack some with Chris Jefferson who acknowledges the bug and that it would be good to try to work around in GAP, although we haven't actually attempted a workaround yet.

embray commented 5 years ago

comment:7

Here is a branch that adds my latest patch to GAP for Cygwin.

It is based on my PR, but backported for the 4.10-stable branch so that it applies to the current GAP used in Sage. This approach to fixing the issue hasn't been approved upstream yet so I would hold off on merging this until there's been more time for things to settle.

But I can confirm that this does fix (or at least successfully work around) the problem for me in all known cases so it would still be good to include this if it is not fixed soon in GAP.

I'll also note that this patch has no impact on other platforms, so it is not a runtime requirement for GAP on any other platform.

embray commented 5 years ago

Author: Erik Bray

embray commented 5 years ago

Branch: u/embray/cygwin/ticket-27724

embray commented 5 years ago

Description changed:

--- 
+++ 
@@ -7,3 +7,4 @@
 In the meantime, between this and #27721 it might be a good idea to disable use of the system GMP on Cygwin altogether until and unless I can get to the bottom of this.

 **Upstream issue for GAP:** https://github.com/gap-system/gap/issues/3434
+**Upstream PR for GAP:** https://github.com/gap-system/gap/pull/3435

embray commented 5 years ago

Description changed:

--- 
+++ 
@@ -7,4 +7,5 @@
 In the meantime, between this and #27721 it might be a good idea to disable use of the system GMP on Cygwin altogether until and unless I can get to the bottom of this.

 **Upstream issue for GAP:** https://github.com/gap-system/gap/issues/3434
+
 **Upstream PR for GAP:** https://github.com/gap-system/gap/pull/3435

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Commit: 067183f

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

067183f Trac #27724: Include patch from upstream PR to workaround possible

embray commented 5 years ago

comment:10

Just as a note, it turns out this problem is long-since fixed in MPIR due in large part to its adoption of YASM, which in turn has implicit support for SEH on x64 Windows (which MPIR makes proper use of).

While I would like to get this problem fixed back in the original GMP as well and am willing to work on it, it might be easier just to package MPIR for Cygwin (and furthermore, try to convince Cygwin to use MPIR as its default GMP implementation). Part of the reason MPIR exists has always been its dedicated Windows support so this might make sense.

dimpase commented 5 years ago

Changed upstream from Reported upstream. Developers acknowledge bug. to Fixed upstream, in a later stable release.

dimpase commented 5 years ago

comment:11

upstream says OK, as far as I can see. Should we go ahead and merge this?

embray commented 5 years ago

comment:12

I need to update the patch to match the version that was accepted, but then yes, it would be good to merge this.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

13e1798 Trac #27724: Include patch from upstream PR to workaround possible

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Changed commit from 067183f to 13e1798

embray commented 5 years ago

comment:16

Updated patch to the version accepted upstream. This should be ready to go.

embray commented 5 years ago

comment:17

Moving tickets from the Sage 8.8 milestone that have been actively worked on in the last six months to the next release milestone (optimistically).

slel commented 5 years ago

Reviewer: Samuel Lelièvre

slel commented 5 years ago

comment:18

This will be in GAP 4.11 but was not backported to GAP 4.10.2, see:

so let's get this in here. Positive review.

vbraun commented 5 years ago

comment:19

Merge conflict

slel commented 5 years ago

comment:20

Sorry I missed one detail. The branch refers to GAP 4.10.1 but we now have GAP 4.10.2, so the file build/pkgs/gap/package-version.txt should read 4.10.2.p1 instead of 4.10.1.p1.

I tested on Cygwin on Windows 7 that Sage builds fine if that change is made. Please @embray can you make that change? You can then set to positive review on my behalf.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Changed commit from 13e1798 to 1077f3d

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

1077f3d Trac #27724: Include patch from upstream PR to workaround possible

embray commented 5 years ago

comment:22

Rebased, and confirmed that the patch still applies.

vbraun commented 5 years ago

Changed branch from u/embray/cygwin/ticket-27724 to 1077f3d

sagemath / sage

GAP Bernoulli function crashes on Cygwin with system GMP #27724