sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.39k stars 473 forks source link

Segfault testing src/sage/libs/gap/element.pyx on python 3.12 #37026

Open tornaria opened 9 months ago

tornaria commented 9 months ago

Concerning the error in src/sage/libs/gap/element.pyx, it happens in line 2489:

for i in range(100):
        rnd = [ randint(-10,10) for i in range(randint(0,7)) ]
        # compute the sum in GAP
        _ = libgap.Sum(rnd)
        try:
            libgap.Sum(*rnd)
            print('This should have triggered a ValueError')
            print('because Sum needs a list as argument')
        except ValueError:
            pass

getting Killed due to segmentation fault. Running this code in the terminal works, the error appears only in tests and with python 3.12.

Originally posted by @enriqueartal in https://github.com/sagemath/sage/issues/37011#issuecomment-1880050744

dimpase commented 9 months ago

my eyes bleed from this line:

        rnd = [ randint(-10,10) for i in range(randint(0,7)) ]

How about

        rnd = random.choices(range(-10,10), k=randint(0,7))

?

See https://docs.python.org/3/library/random.html#random.choices

dimpase commented 9 months ago

maybe it segfaults on libgap.Sum(*rnd) when rnd==[] ?

dimpase commented 7 months ago

This also pops up on Fedora 39 with system Python 3.12, see e.g. https://groups.google.com/g/sage-release/c/oTnTrEnB3YU/m/tZiLm66ZAwAJ

dimpase commented 7 months ago

But I can't reproduce this with Python 3.12.2

enriqueartal commented 7 months ago

Is it possible that both @jaapspies and me are missing some python3 system package?

enriqueartal commented 7 months ago

At some point this segfault occured only in tests, now it happens also in the command line

kiwifb commented 6 months ago

my eyes bleed from this line:

        rnd = [ randint(-10,10) for i in range(randint(0,7)) ]

How about

        rnd = random.choices(range(-10,10), k=randint(0,7))

?

See https://docs.python.org/3/library/random.html#random.choices

Tried that but it does not have an effect

sage: import random ## line 2490 ##
sage: for i in range(100):
    rnd = random.choices(range(-10,10), k=randint(0,7))
    # compute the sum in GAP
    _ = libgap.Sum(rnd)
    try:
        libgap.Sum(*rnd)
        print('This should have triggered a ValueError')
        print('because Sum needs a list as argument')
    except ValueError:
        pass ## line 2491 ##
------------------------------------------------------------------------
/usr/lib/python3.12/site-packages/cysignals/signals.cpython-312-x86_64-linux-gnu.so(+0xa0d4)[0x7fbe2865c0d4]
/usr/lib/python3.12/site-packages/cysignals/signals.cpython-312-x86_64-linux-gnu.so(+0xa196)[0x7fbe2865c196]
/usr/lib/python3.12/site-packages/cysignals/signals.cpython-312-x86_64-linux-gnu.so(+0xcbf6)[0x7fbe2865ebf6]
/usr/lib64/libc.so.6(+0x3a210)[0x7fbe2945f210]
tornaria commented 6 months ago

There's a more detailed trace in https://github.com/sagemath/sage/pull/36407#issuecomment-2068709126

antonio-rojas commented 5 months ago

Caused by GCC commit https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=049ec9b981d1f4f97736061d5cf7d0ae990b57d7

Upstream report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872

Can be reproduced by calling any libgap function in exactly 3 parameters, eg. libgap.AbelianGroup(0,0,0)

dimpase commented 5 months ago

@sheerluck has kindly disassembled broken element....so, resulting from compilation of element.c. I report the result here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872#c12 (that is, code generation is broken, the call to GAP_CallFunc3Args() is formed with one parameter less than it should be)

Fixing gcc is above my payrate, though :-)

dimpase commented 5 months ago

more discussion on https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 seems to indicate that gcc is actually not broken, it's our code that is (missing volatile declarations). See also https://stackoverflow.com/questions/7996825/why-volatile-works-for-setjmp-longjmp

If I am not mistaken, we have

sig_on()
...
GAP_CallFunc3Args(foo0,...,foo3)
...
sign_off()

and the longmp() triggered somewhere in libgap, called by GAP_CallFunc3Args, invalidates the registers' contents. Now, if foo3, say, was held in a register, it is gone for good. Thus unwinding this call, which involves a GAP object held in foo3, is no longer possible, and triggers a segfault.

Looks like a lot of volatile to add all over the place: (anywhere libgap, singular (which also uses setjmp/longjmp) is called, any other setjmp/longjmp-ing upstream code?

dimpase commented 5 months ago

See https://trofi.github.io/posts/312-the-sagemath-saga.html for a detailed analysis. Not sure whether the detail that GAP uses setjmp/longjmp to return from GAP_CallFunc3Args() here in the 1st place.

dimpase commented 5 months ago

By the way, the Python 3.12 is only needed to produce a crash. Otherwise, we get a memory leak, as these temporary GAP objects are not cleared.

$ ./sage
┌────────────────────────────────────────────────────────────────────┐
│ SageMath version 10.4.beta6, Release Date: 2024-05-12              │
│ Using Python 3.11.9. Type "help()" for help.                       │
└────────────────────────────────────────────────────────────────────┘

 sage: libgap.count_GAP_objects()
0
sage: libgap.Sum(1,2,42)
---------------------------------------------------------------------------
GAPError                                  Traceback (most recent call last)
Cell In[2], line 1
----> 1 libgap.Sum(Integer(1),Integer(2),Integer(42))

File /home/scratch/scratch2/dimpase/sage/sage/src/sage/libs/gap/element.pyx:2514, in sage.libs.gap.element.GapElement_Function.__call__()
   2512 try:
   2513     sig_GAP_Enter()
-> 2514     sig_on()
   2515     if n == 0:
   2516         result = GAP_CallFunc0Args(self.value)

GAPError: Error, no method found! Error, no 1st choice method found for `SumOp' on 3 arguments
sage: libgap.count_GAP_objects()
4
sage: libgap.Sum(3,5,7)
---------------------------------------------------------------------------
GAPError                                  Traceback (most recent call last)
Cell In[4], line 1
----> 1 libgap.Sum(Integer(3),Integer(5),Integer(7))

File /home/scratch/scratch2/dimpase/sage/sage/src/sage/libs/gap/element.pyx:2514, in sage.libs.gap.element.GapElement_Function.__call__()
   2512 try:
   2513     sig_GAP_Enter()
-> 2514     sig_on()
   2515     if n == 0:
   2516         result = GAP_CallFunc0Args(self.value)

GAPError: Error, no method found! Error, no 1st choice method found for `SumOp' on 3 arguments
sage: libgap.count_GAP_objects()
7
antonio-rojas commented 4 months ago

I'm afraid this is still happening in 10.4.beta7, which is supposed to include 72e6b66b1c9699cef63922c988c40031a5fc5925

dimpase commented 4 months ago

we should do rigorous fixing, not just an ad hoc like this.

We need a stateless way to create GAP objects, which can be automatically destroyed.

collares commented 3 months ago

I'm afraid this is still happening in 10.4.beta7, which is supposed to include 72e6b66

I'm also seeing this problem on NixOS CI. With the patch in #37951 applied, the temporary variable used to hold <GapElement>a[1] seems to suffer from the same problem that affected the <GapElement>a[2] temporary: GCC removes a null check, and at the same time proves the variable is always null, and the end result is that it dereferences a null pointer. This feels like a big minefield. I think I will just apply the following patch for the time being.

diff --git a/src/sage/libs/gap/element.pyx b/src/sage/libs/gap/element.pyx
index b2b0681c052..b2e5807392c 100644
--- a/src/sage/libs/gap/element.pyx
+++ b/src/sage/libs/gap/element.pyx
@@ -35,6 +35,13 @@ from sage.structure.coerce cimport coercion_model as cm
 ### helper functions to construct lists and records ########################
 ############################################################################

+
+cdef extern from *:
+    """
+    #pragma GCC optimize("O1")
+    """
+
+
 cdef Obj make_gap_list(sage_list) except NULL:
     """
     Convert Sage lists into Gap lists
orlitzky commented 1 week ago

This is back:

$ sage
┌────────────────────────────────────────────────────────────────────┐
│ SageMath version 10.5.beta6, Release Date: 2024-09-29              │
│ Using Python 3.12.6. Type "help()" for help.                       │
└────────────────────────────────────────────────────────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Warning: this is a prerelease version, and it may be unstable.     ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
sage: libgap.AbelianGroup(0,0,0)
...
Unhandled SIGSEGV: A segmentation fault occurred.
This probably occurred because a *compiled* module has a bug
in it and is not properly wrapped with sig_on(), sig_off().
Python will now terminate.
------------------------------------------------------------------------
Segmentation fault
enriqueartal commented 1 week ago

I am getting this:

┌────────────────────────────────────────────────────────────────────┐
│ SageMath version 10.5.beta6, Release Date: 2024-09-29              │
│ Using Python 3.12.6. Type "help()" for help.                       │
└────────────────────────────────────────────────────────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Warning: this is a prerelease version, and it may be unstable.     ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
sage: libgap.AbelianGroup(0,0,0)
---------------------------------------------------------------------------
GAPError                                  Traceback (most recent call last)
Cell In[1], line 1
----> 1 libgap.AbelianGroup(Integer(0),Integer(0),Integer(0))

File /usr/local/sagedev/src/sage/libs/gap/element.pyx:2520, in sage.libs.gap.element.GapElement_Function.__call__()
   2518 try:
   2519     sig_GAP_Enter()
-> 2520     sig_on()
   2521     if n == 0:
   2522         result = GAP_CallFunc0Args(self.value)

The same output in Sage 10.4

dimpase commented 1 week ago

Perhaps more volatiles have to be added, like it was done there:

72e6b66b1c96 (Dima Pasechnik    2024-05-06 23:53:47 +0100 2488)         cdef volatile Obj v2

Compilers get more efficient in optimising register allocation, killing our stateful interfaces with GAP and probably more...

orlitzky commented 1 week ago

I'm sure it is still system-dependent since it was originally "triggered" by a minor GCC update, but I've got it on two very different machines with both gcc-13 and gcc-14 now.

I did try adding more volatiles for the other arguments to the GAP function calls, but it doesn't seem to help.

dimpase commented 1 week ago

What GAP versions do you test with?

orlitzky commented 1 week ago

What GAP versions do you test with?

Usually with 4.13.1 from Gentoo, but this is happening with the 4.12.2 SPKG as well.

kiwifb commented 1 week ago

OK, so I have those two machines one is "sage-on-gentoo" release (currently 10.4) and the other is "sage-on-gentoo" cutting edge (Volker's merging branch). I observe the segfault in the release one and @enriqueartal's output in the cutting edge one.

kiwifb commented 1 week ago

Of interest to me is the explicit coercion of 0 to Integer(0) in the second case which is stuff that I have seen appear after incorporating the elimination of sage's mpmath backend in #38113 which is currently applied in "sage-on-gentoo cutting edge".

dimpase commented 1 week ago

I see @enriqueartal 's output with 10.5.beta6 and Sage's GAP. Same output if I explicitly use libgap(0) rather than 0 in the call.

I get a coredump in Linux x86_64 conda-installed, with Python 3.12, Sage 10.4.

orlitzky commented 1 week ago

Testing https://github.com/sagemath/sage/pull/38804 with the new gap-4.13.1 SPKG:

sage -t --long --warn-long 77.9 --random-seed=14617378659005491983447022349844162666\
9 src/sage/libs/gap/element.pyx
**********************************************************************
File "src/sage/libs/gap/element.pyx", line 2440, in sage.libs.gap.element.GapElement\
_Function.__call__
Failed example:
    sorted(a(b))
Expected:
    [Group(()),
     Sym( [ 1 .. 4 ] ),
     Alt( [ 1 .. 4 ] ),
     Group([ (1,4)(2,3), (1,2)(3,4) ])]
Got:
    [Group(()),
     Sym( [ 1 .. 4 ] ),
     Alt( [ 1 .. 4 ] ),
     Group([ (1,4)(2,3), (1,3)(2,4) ])]
    Killed due to segmentation fault
**********************************************************************
orlitzky commented 1 week ago

...which of course comes right after

sage: for i in range(100):
    rnd = [ randint(-10,10) for i in range(randint(0,7)) ]
    # compute the sum in GAP
    _ = libgap.Sum(rnd)
    try:
        libgap.Sum(*rnd)
        print('This should have triggered a ValueError')
        print('because Sum needs a list as argument')
    except ValueError:
        pass ## line 2494 ##