Closed embray closed 5 years ago
The reason this occurs specifically with Bernoulli
makes sense to me now, as does the interaction with the -m
flag. GAP memoizes each subresult, so computing something like Bernoulli(1724)
probably hits GAP's initial workspace size requiring some reallocation (though I'm not sure what the implication of this is in the context of it using an mmap'd memory pool.
This is the point at which it crashes. Still not sure why though, since any memory allocation would be happening at the point of trying to allocate a new Bag, at which point it should just work. But the crashes are happening inside GMP functions. Maybe somehow reallocating moves some bags around that contain data still being used by GMP? I'm not sure.
I think this might be a reprise of #27214. The problem, I think seems to be that GMP's assembly routines are confusing Windows' stack unwinding on exceptions, and thus preventing the Cygwin exception handler from being called.
Indeed, I can demonstrate the problem with this example code:
$ cat test.c
#include <gmp.h>
#include <stdlib.h>
int main(void) {
mp_limb_t *rlp;
mp_size_t qxn = 0;
mp_limb_t s2p[1] = {0};
mp_size_t s2n = 1;
mp_limb_t s3limb = 1;
/* Use of this function in particular is arbitrary aside from
* the fact that it is known to demonstrate the problem on my
* system (my system uses an assembly implementation for it)
*/
/* Just set to something that will segfault when accessed */
rlp = (mp_limb_t*)0x1234;
mpn_divrem_1(rlp, qxn, s2p, s2n, s3limb);
return 1;
}
$ gcc test.c -lgmp
$ ./a.exe && echo $?
0
It should print that it got a SIGSEGV and exit with the corresponding exit code (139). A "normal" program dereferencing an invalid pointer will do this just fine on Cygwin. Instead, it just exits 0, which is the (somewhat unfortunate) behavior when a program crashes outside Cygwin's control. If I run it in gdb I do see that an access violation occurs inside the assembly implementation of mpn_divrem_1
when it first tries to write something to *rlp
.
I've been experimenting with this a bit further trying to understand exactly where things are going wrong, and while it's still not 100% clear to me it might actually be a bug in the NT kernel, or at least tripping some exceptional behavior in exception handling (or maybe a subtle bug in Cygwin related to this).
What I found so far is that in GMP's assembly code it's actually using the %rbp
register just as a normal register in its calculations. It saves the old value on the stack and pops it at the end, but this obviously doesn't help if an exception occurs in the function body. It seems that the NT exception handling code hates this. Indeed I can take a relatively "innocent" program that generates a segfault like:
$cat test3.c
#include <stdlib.h>
void foo() {
volatile int *i = (int *)0xffffffff;
*i = 0;
}
int main(void) {
foo();
return 1;
}
and modify it like
$cat test4.c
#include <stdlib.h>
void foo() {
volatile int *i = (int *)0xffffffff;
/* First let's do something weird to rbp, just to check a theory... */
asm("mov $0, %rbp");
*i = 0;
}
int main(void) {
foo();
return 1;
}
I compile this like
$ gcc -fomit-frame-pointer test4.c
so that it doesn't even use %rbp
in the function prologue and doesn't use it to address local variables, so the disassembly of foo
in this case looks like:
0000000000000000 <foo>:
0: 48 83 ec 18 sub $0x18,%rsp
4: b8 ff ff ff ff mov $0xffffffff,%eax
9: 48 89 44 24 08 mov %rax,0x8(%rsp)
e: 48 c7 c5 00 00 00 00 mov $0x0,%rbp
15: 48 8b 44 24 08 mov 0x8(%rsp),%rax
1a: c7 00 00 00 00 00 movl $0x0,(%rax)
20: 90 nop
21: 48 83 c4 18 add $0x18,%rsp
25: c3 retq
so it's not even using %rbp
(ditto in main
) except where I set it to zero.
Of course, in Cygwin, by the time the main
function is called there are actually several layers of initialization before that which are quite complex, at the top of which is where Cygwin's exception handler is installed. So of course when entering main
the value of %rbp
is certainly meaningful for walking back up the stack to the frame the exception handler is installed for.
How exactly Windows does this I'm not sure, but what I do know is when setting %rbp
to some bogus value I actually get a second access violation somewhere in a kernel function called RtlWalkFrameChain
which is called by KiUserExceptionDispatcher
(this I know is the function that locates and dispatches user-mode exception handlers, hence the name). The fact that RtlWalkFrameChain
itself blows up feels like a bug to me, though obviously by messing with %rbp
we're violating some assumption it's making. This in turn results in re-entering KiUserExceptionDispatcher
in an infinite recursion, because it keeps blowing up in RtlWalkFrameChain
, until I guess the stack overflows and the kernel gives up on user-mode exception handling and just kills the process. I need to refresh my memory on how exactly this works, as I know a lot of it has been reverse-engineered.
Regardless, the moral of the story is don't abuse %rbp
if you expect exception handling to work properly. This is a problem for GMP...
I think I see now why it's so hell-bent on RBP: For every the compiler generates an entry in the .xdata
section ("x" for "exception" I guess) of the PE32 binary, which is a table of UNWIND_INFO
structures explained here: https://docs.microsoft.com/en-us/cpp/build/exception-handling-x64?view=vs-2019#raw-pseudo-operations
Each UNWIND_INFO
includes an array of special opcodes that explains how that function's frame is set up:
An array of items that explains the effect of the prolog on the nonvolatile registers and RSP.
It also has a "frame register" field which is non-zero if the function does use a frame pointer like RBP:
If nonzero, then the function uses a frame pointer (FP), and this field is the number of the nonvolatile register used as the frame pointer, using the same encoding for the operation info field of UNWIND_CODE nodes.
You can use the dumpbin
utility to view these sections, and indeed running it on cygwin1.dll
and finding the UNWIND_INFO
structure for _cygtls::call
(which is essentially the entrypoint to a Cygwin executable's main thread, and where the exception handler is installed) shows:
00000618 00006E70 00006F77 0026A51C _ZN7_cygtls5call2EPFjPvS0_ES0_S0_
Unwind version: 1
Unwind flags: None
Size of prologue: 0x0D
Count of codes: 7
Frame register: rbp
Frame offset: 0x0
Unwind codes:
0D: ALLOC_SMALL, size=0x20
09: SET_FPREG, register=rbp, offset=0x00
06: PUSH_NONVOL, register=rbx
05: PUSH_NONVOL, register=rsi
04: PUSH_NONVOL, register=rdi
03: PUSH_NONVOL, register=r12
01: PUSH_NONVOL, register=rbp
These opcodes basically tell the stack unwinder exactly what it needs to back out of each function on stack until it gets back to the frame where the exception will be handled.
In GCC there are equivalent pseudo-operations that can be used in assembly (I have seen these before but wasn't really sure how they work) named .seh_*
which are used for generating these .pdata
and .xdata
entries for each function in the object file. You can see these directly in the GCC assembly output on Windows. For example for my test3.c
above it outputs:
.globl foo
.def foo; .scl 2; .type 32; .endef
.seh_proc foo
foo:
pushq %rbp
.seh_pushreg %rbp
movq %rsp, %rbp
.seh_setframe %rbp, 0
subq $16, %rsp
.seh_stackalloc 16
.seh_endprologue
movl $4294967295, %eax
...
This explicitly delineates which instructions comprise the function's prologue and what they do: push register RBP on the stack, set a frame pointer at 0 offst from RBP, allocate 16 bytes on the stack.
If assembled with the -fomit-frame-pointer
flag it's even simpler, as to be expected:
.globl foo
.def foo; .scl 2; .type 32; .endef
.seh_proc foo
foo:
subq $24, %rsp
.seh_stackalloc 24
.seh_endprologue
movl $4294967295, %eax
...
i.e. just allocate 24 bytes on the stack.
I hand modified the assembly for test4.c
above so that foo
is defined:
.globl foo
.def foo; .scl 2; .type 32; .endef
.seh_proc foo
foo:
pushq %rbp
.seh_pushreg %rbp
subq $24, %rsp
.seh_stackalloc 24
.seh_endprologue
movl $4294967295, %eax
movq %rax, 8(%rsp)
mov $0, %rbp
movq 8(%rsp), %rax
movl $0, (%rax)
nop
addq $24, %rsp
popq %rbp
ret
.seh_endproc
So it notes explicitly that %rbp
is a non-volatile register that should be saved (it doesn't say anything about it serving in this function as a frame pointer in the case of this function).
But with this bit in place (and it must include the .seh_pushreg
or else it doesn't work) the kernel can successfully unwind the stack to where the exception is handled by Cygwin.
So this is what the assembly code in GMP should be doing, but isn't. I'll propose a patch for that and see if I can get it into Cygwin's GMP package if nothing else. In the meantime I'd need another workaround...
Changed keywords from none to cygwin gap gmp
Upstream: Reported upstream. Developers acknowledge bug.
Description changed:
---
+++
@@ -5,3 +5,5 @@
For the most part GAP otherwise works fine and most operations do not cause any problems at all, so it is likely a narrow case. Also strangely, if I remove the `-m 64m` flag when running GAP (a default argument used when running `sage -gap`) the crash does not occur. Likewise if I provide a larger value like `-m 128m` it does not occur. I need to double-check exactly what the effect of this argument is.
In the meantime, between this and #27721 it might be a good idea to disable use of the system GMP on Cygwin altogether until and unless I can get to the bottom of this.
+
+**Upstream issue for GAP:** https://github.com/gap-system/gap/issues/3434
Added an issue for this on GAP. Have already discussed the issue on Slack some with Chris Jefferson who acknowledges the bug and that it would be good to try to work around in GAP, although we haven't actually attempted a workaround yet.
Here is a branch that adds my latest patch to GAP for Cygwin.
It is based on my PR, but backported for the 4.10-stable branch so that it applies to the current GAP used in Sage. This approach to fixing the issue hasn't been approved upstream yet so I would hold off on merging this until there's been more time for things to settle.
But I can confirm that this does fix (or at least successfully work around) the problem for me in all known cases so it would still be good to include this if it is not fixed soon in GAP.
I'll also note that this patch has no impact on other platforms, so it is not a runtime requirement for GAP on any other platform.
Author: Erik Bray
Branch: u/embray/cygwin/ticket-27724
Description changed:
---
+++
@@ -7,3 +7,4 @@
In the meantime, between this and #27721 it might be a good idea to disable use of the system GMP on Cygwin altogether until and unless I can get to the bottom of this.
**Upstream issue for GAP:** https://github.com/gap-system/gap/issues/3434
+**Upstream PR for GAP:** https://github.com/gap-system/gap/pull/3435
Description changed:
---
+++
@@ -7,4 +7,5 @@
In the meantime, between this and #27721 it might be a good idea to disable use of the system GMP on Cygwin altogether until and unless I can get to the bottom of this.
**Upstream issue for GAP:** https://github.com/gap-system/gap/issues/3434
+
**Upstream PR for GAP:** https://github.com/gap-system/gap/pull/3435
Commit: 067183f
Branch pushed to git repo; I updated commit sha1. New commits:
067183f | Trac #27724: Include patch from upstream PR to workaround possible |
Just as a note, it turns out this problem is long-since fixed in MPIR due in large part to its adoption of YASM, which in turn has implicit support for SEH on x64 Windows (which MPIR makes proper use of).
While I would like to get this problem fixed back in the original GMP as well and am willing to work on it, it might be easier just to package MPIR for Cygwin (and furthermore, try to convince Cygwin to use MPIR as its default GMP implementation). Part of the reason MPIR exists has always been its dedicated Windows support so this might make sense.
Changed upstream from Reported upstream. Developers acknowledge bug. to Fixed upstream, in a later stable release.
upstream says OK, as far as I can see. Should we go ahead and merge this?
I need to update the patch to match the version that was accepted, but then yes, it would be good to merge this.
Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
13e1798 | Trac #27724: Include patch from upstream PR to workaround possible |
Updated patch to the version accepted upstream. This should be ready to go.
Moving tickets from the Sage 8.8 milestone that have been actively worked on in the last six months to the next release milestone (optimistically).
Reviewer: Samuel Lelièvre
This will be in GAP 4.11 but was not backported to GAP 4.10.2, see:
so let's get this in here. Positive review.
Merge conflict
Sorry I missed one detail. The branch refers to GAP 4.10.1 but we now have GAP 4.10.2,
so the file build/pkgs/gap/package-version.txt
should read 4.10.2.p1
instead of 4.10.1.p1
.
I tested on Cygwin on Windows 7 that Sage builds fine if that change is made. Please @embray can you make that change? You can then set to positive review on my behalf.
Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
1077f3d | Trac #27724: Include patch from upstream PR to workaround possible |
Rebased, and confirmed that the patch still applies.
Changed branch from u/embray/cygwin/ticket-27724 to 1077f3d
It is crashing in various different (depending on the arguments I give) GMP functions for large integers, with segmentation faults in the assembly code.
The situation is bad enough that it causes irrecoverable stack corruption and the process dies without Cygwin's exception handling even able to run. I don't know that the problem is likely anything specifically to do with the
Bernoulli
function--that's just the context in which I've been able to reproduce the problem reliably.For the most part GAP otherwise works fine and most operations do not cause any problems at all, so it is likely a narrow case. Also strangely, if I remove the
-m 64m
flag when running GAP (a default argument used when runningsage -gap
) the crash does not occur. Likewise if I provide a larger value like-m 128m
it does not occur. I need to double-check exactly what the effect of this argument is.In the meantime, between this and #27721 it might be a good idea to disable use of the system GMP on Cygwin altogether until and unless I can get to the bottom of this.
Upstream issue for GAP: https://github.com/gap-system/gap/issues/3434
Upstream PR for GAP: https://github.com/gap-system/gap/pull/3435
Upstream: Fixed upstream, in a later stable release.
CC: @slel
Component: packages: standard
Keywords: cygwin gap gmp
Author: Erik Bray
Branch/Commit:
1077f3d
Reviewer: Samuel Lelièvre
Issue created by migration from https://trac.sagemath.org/ticket/27724