semigroups / Semigroups

The GAP package Semigroups
https://semigroups.github.io/Semigroups/
Other
23 stars 35 forks source link

Semigroups 5.0.0 sometimes segfaults in its test suite #853

Closed fingolfin closed 2 years ago

fingolfin commented 2 years ago

A few examples:

  1. https://github.com/gap-system/PackageDistro/runs/7712380255?check_suite_focus=true
  2. https://github.com/gap-system/PackageDistro/runs/7709227638?check_suite_focus=true
  3. https://github.com/gap-system/PackageDistro/runs/7707239115?check_suite_focus=true

In each of these three examples, the crash message is different, but it's always in the same test file:

testing: /home/runner/gap/pkg/semigroups-5.0.0/tst/standard/attributes/homomorph.tst
munmap_chunk(): invalid pointer
/home/runner/work/_temp/bd4cd404-7555-4f8a-b66a-b08af65c76a5.sh: line 28: 30506 Aborted                 (core dumped) gap -A --quitonbreak -r -c "
testing: /home/runner/gap/pkg/semigroups-5.0.0/tst/standard/attributes/homomorph.tst
free(): invalid pointer
/home/runner/work/_temp/498e066f-baf0-4207-9de9-8c8c04e62bbc.sh: line 28:  3813 Aborted                 (core dumped) gap -A --quitonbreak -r -c "
testing: /home/runner/gap/pkg/semigroups-5.0.0/tst/standard/attributes/homomorph.tst
/home/runner/work/_temp/b10dc5b7-db90-4010-98a8-8e1c9f26a92c.sh: line 28: 48001 Segmentation fault      (core dumped) gap -A --quitonbreak -r -c "
fingolfin commented 2 years ago

These crash could be explained by GC issues, such as forgotten CHANGED_BAG calls. I had a quick look to see if I spotted something, and made PR #854 based on this. But even with that patch applied, when I run semigroups against a GAP version built in debug mode, I get this GC crash right away:

$ /Users/mhorn/Projekte/GAP/gap/out-of-tree/debug/gap -A tst/teststandard.g
 ┌───────┐   GAP 4.12dev built on 2022-01-28 23:59:25+0100
 │  GAP  │   https://www.gap-system.org
 └───────┘   Architecture: aarch64-apple-darwin21-default64-kv8
 Configuration:  gmp 6.2.1, GASMAN, readline, KernelDebug
 Loading the library and packages ...
 Packages:   GAPDoc 1.6.5, PrimGrp 3.4.2, SmallGrp 1.5, TransGrp 3.6.3
 Try '??help' for help. See also '?copyright', '?cite' and '?authors'
#I  Running tests with acting methods enabled
Semigroups package: testinstall.tst
msecs: 11524
#I  Running tests with acting methods disabled
# line 188 of 1860 (10%)Assertion failed: (IS_LVARS_OR_HVARS((ActiveGAPState()->CurrLVars))), function CURR_FUNC, file vars.h, line 139.
Abort
0   gap                                 0x00000001001fb224 BacktraceHandler + 44
1   libsystem_platform.dylib            0x00000001c0c1b4a4 _sigtramp + 56
2   libsystem_pthread.dylib             0x00000001c0c03ee0 pthread_kill + 288
3   libsystem_c.dylib                   0x00000001c0b3e340 abort + 168
4   libsystem_c.dylib                   0x00000001c0b3d754 err + 0
5   gap                                 0x00000001001f5090 CURR_FUNC + 72
6   gap                                 0x00000001002def38 VarsAfterCollectBags + 64
7   gap                                 0x00000001003028ac CollectBags + 496
8   gap                                 0x0000000100302ee8 ResizeBag + 860
9   gap                                 0x0000000100277e20 GrowPlist + 200
10  gap                                 0x00000001001f8294 GROW_PLIST + 180
11  gap                                 0x0000000100277f08 AssPlist + 60
12  orb.so                              0x00000001012f7d50 HTAdd_TreeHash_C + 848
13  gap                                 0x00000001001f3ab8 CALL_3ARGS + 60

I guess one could try to use a valgrind-enabled debug build to track this down further. But valgrind doens't work anymore on macOS, so I gotta try this on a Linux machine tomorrow.

Also, perhaps @ChrisJefferson can help, be it with ideas or actively debugging :-)

ChrisJefferson commented 2 years ago

I am going to dig out all the "heavy duty" stuff today, and see if anything falls out. I did do a quick test with the test which was failing and it seemed to pass, so not sure if it needs multiple tests, or I wasn't testing in the right setup.

ChrisJefferson commented 2 years ago

i think that test does a bunch of threading, so it could be a race condition. hope not!

james-d-mitchell commented 2 years ago

Thanks for the report, I'll have a look this morning!

ChrisJefferson commented 2 years ago

Just one small point @fingolfin , I sometimes see those types of errors when some package needs rebuilding after changing debugging / valgrind related options, so it's possible things are fixed (I can't reproduce that immediate crash with the master branch).

fingolfin commented 2 years ago

@ChrisJefferson I've done full clean rebuilds of Semigroups a dozen times or more last night... But for the sake of it, I did git clean -fdx in my Semigroups clone, made sure I was on its latest master (with my PR from last night merged). Then did:

./prerequisites.sh
./autogen.sh
./configure --with-gaproot=$GAPPROJ/gap/out-of-tree/debug
make -j12

Note that I am pointing it at my out-of-tree build of GAP that uses --enable-debug and turns off optimizations.

Then I edited PackageInfo.g of semigroups to change the version to 5.0.0dev to make sure it really loads that copy of Semigroups (and I then verified that this is indeed the case).

Finally started that GAP, pointing it at the dir which contains the Semigroups git clone (:

$GAPPROJ/gap/out-of-tree/debug/gap -A -l "$GAPPROJ/repos;" tst/teststandard.g

Right now, the result is not an error, but rather it hangs (when I say "hang" I mean: pressing Ctrl-C also does nothing, so it runs somewhere deep in C code). With a non-debug GAP version, this test seems to pass quickly? This is what I see:

#I  Running tests with acting methods enabled
# line 796 of 1860 (42%)

I believe the line numbers refer to tst/testinstall.tst (as it has 1860 lines), but it seems the Semigroups test system does not print this information?

On the upside, I've now discovered SemigroupsTestStandard() and have now started that. So far it seems to be chugging along w/o a crash (YAY!) and I'll check on how it performed after I am back from my lunch break.

fingolfin commented 2 years ago

Ahh, sadly, no need to wait, it just crashed:

testing: /Users/mhorn/Projekte/GAP/repos/pkg/others/Semigroups/tst/standard/greens/acting.tst
    8235 ms (733 ms GC) and 327MB allocated for greens/acting.tst
testing: /Users/mhorn/Projekte/GAP/repos/pkg/others/Semigroups/tst/standard/greens/froidure-pin.tst
Assertion failed: (IS_PLIST_OR_POSOBJ(list)), function LEN_PLIST, file plist.h, line 167.
Abort
0   gap                                 0x0000000104333224 BacktraceHandler + 44
1   libsystem_platform.dylib            0x00000001c0c1b4a4 _sigtramp + 56
2   libsystem_pthread.dylib             0x00000001c0c03ee0 pthread_kill + 288
3   libsystem_c.dylib                   0x00000001c0b3e340 abort + 168
4   libsystem_c.dylib                   0x00000001c0b3d754 err + 0
5   gap                                 0x00000001043304d8 LEN_PLIST + 72
6   gap                                 0x00000001043afeec AssPlist + 32
7   semigroups.so                       0x0000000108794b00 _Z13FIND_HCLASSESP9OpaqueBagS0_S0_ + 876
8   gap                                 0x000000010432ba70 CALL_2ARGS + 52
9   gap                                 0x00000001043433a4 EvalOrExecCall + 584
10  gap                                 0x0000000104342e18 EvalFunccall2args + 36
11  gap                                 0x000000010432d5b0 EVAL_EXPR + 108
12  gap                                 0x0000000104413f7c ExecAssLVar + 32
13  gap                                 0x00000001043e61dc EXEC_STAT + 60
14  gap                                 0x00000001043e7f0c ExecSeqStatHelper + 72
15  gap                                 0x00000001043e6b14 ExecSeqStat7 + 28
16  gap                                 0x00000001043e61dc EXEC_STAT + 60
17  gap                                 0x00000001043e61fc EXEC_CURR_FUNC + 20
18  gap                                 0x000000010434274c DoExecFunc + 156
19  gap                                 0x0000000104342000 DoExecFunc1args + 60
20  gap                                 0x000000010432ba30 CALL_1ARGS + 44
21  gap                                 0x00000001043982f8 _ZL9CallNArgsILl1EEP9OpaqueBagS1_S1_S1_S1_S1_S1_S1_ + 52
22  gap                                 0x0000000104393d14 _ZL16DoOperationNArgsILl1ELl0ELl0EEP9OpaqueBagS1_S1_S1_S1_S1_S1_S1_ + 728
23  gap                                 0x0000000104393a30 DoOperation1Args + 52
24  gap                                 0x00000001043969c8 DoAttribute + 116
25  gap                                 0x000000010432ba30 CALL_1ARGS + 44
26  gap                                 0x000000010434338c EvalOrExecCall + 560
27  gap                                 0x0000000104342de8 EvalFunccall1args + 36
28  gap                                 0x000000010432d5b0 EVAL_EXPR + 108
29  gap                                 0x0000000104343208 EvalOrExecCall + 172
30  gap                                 0x0000000104342e18 EvalFunccall2args + 36
31  gap                                 0x000000010432d5b0 EVAL_EXPR + 108

So it crashed in FIND_HCLASSES.

james-d-mitchell commented 2 years ago

I also can't reproduce this, and I'm have difficulty running anything in GAP compiled in debug mode without it segfaulting. For example, the tests of Digraphs crash when Semigroups isn't loaded, I must be doing something wrong.

Also FIND_HCLASSES hasn't changed in a very long time, so I doubt the issue is actually in that function.

james-d-mitchell commented 2 years ago

@fingolfin you mention cleaning up the Semigroups package and rebuilding it, but what about the dependencies?

fingolfin commented 2 years ago

To clarify a few things:

james-d-mitchell commented 2 years ago

Thanks @fingolfin, just out of interest when you say you get the crash on macOS are you using a mac with an arm processor or intel?

I'm a little at a lost about what to do next, for me the Digraphs tests also lead to an assertion failure in GAP (when it's compiled with --enable-debug in the master branch).

fingolfin commented 2 years ago

A few more things:

fingolfin commented 2 years ago

@james-d-mitchell this is on an ARM mac right now (but it also crashes on an Intel Ubuntu machine). I'll have a look at Digraphs, too

james-d-mitchell commented 2 years ago

Thanks @fingolfin, just for sanity check I also tried running the Digraphs package tests with GAP 4.11.1 compiled in debug mode, and I get the same behaviour:

gap> DigraphsTestStandard();
Architecture: arm-apple-darwin21.5.0-default64-kv7

testing: /Users/jdm/gap/pkg/digraphs/tst/standard/attr.tst
   10248 ms (232 ms GC) and 690MB allocated for attr.tst
testing: /Users/jdm/gap/pkg/digraphs/tst/standard/cliques.tst
    1411 ms (60 ms GC) and 35.6MB allocated for cliques.tst
testing: /Users/jdm/gap/pkg/digraphs/tst/standard/constructors.tst
     209 ms (53 ms GC) and 3.55MB allocated for constructors.tst
testing: /Users/jdm/gap/pkg/digraphs/tst/standard/digraph.tst
   12128 ms (4925 ms GC) and 360MB allocated for digraph.tst
testing: /Users/jdm/gap/pkg/digraphs/tst/standard/display.tst
     407 ms (209 ms GC) and 2.70MB allocated for display.tst
testing: /Users/jdm/gap/pkg/digraphs/tst/standard/examples.tst
     337 ms (80 ms GC) and 5.30MB allocated for examples.tst
testing: /Users/jdm/gap/pkg/digraphs/tst/standard/grahom.tst
    1939 ms (87 ms GC) and 52.2MB allocated for grahom.tst
testing: /Users/jdm/gap/pkg/digraphs/tst/standard/grape.tst
     268 ms (79 ms GC) and 2.80MB allocated for grape.tst
testing: /Users/jdm/gap/pkg/digraphs/tst/standard/io.tst
    6849 ms (2027 ms GC) and 169MB allocated for io.tst
testing: /Users/jdm/gap/pkg/digraphs/tst/standard/isomorph.tst
    1425 ms (140 ms GC) and 48.8MB allocated for isomorph.tst
testing: /Users/jdm/gap/pkg/digraphs/tst/standard/labels.tst
     216 ms (90 ms GC) and 435KB allocated for labels.tst
testing: /Users/jdm/gap/pkg/digraphs/tst/standard/obsolete.tst
#I  Test: File does not contain any tests!
       2 ms (0 ms GC) and 135KB allocated for obsolete.tst
testing: /Users/jdm/gap/pkg/digraphs/tst/standard/oper.tst
    4794 ms (771 ms GC) and 126MB allocated for oper.tst
testing: /Users/jdm/gap/pkg/digraphs/tst/standard/orbits.tst
     192 ms (73 ms GC) and 240KB allocated for orbits.tst
testing: /Users/jdm/gap/pkg/digraphs/tst/standard/planar.tst
     471 ms (78 ms GC) and 8.72MB allocated for planar.tst
testing: /Users/jdm/gap/pkg/digraphs/tst/standard/prop.tst
Assertion failed: (IS_PLIST_OR_POSOBJ(list)), function ELM_PLIST, file plist.h, line 198.
Process 97319 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = hit program assert
    frame #4: 0x000000010005c608 gap`ELM_PLIST(list=0x00001000045b7540, pos=4) at plist.h:198:5
   195  */
   196  EXPORT_INLINE Obj ELM_PLIST(Obj list, Int pos)
   197  {
-> 198      GAP_ASSERT(IS_PLIST_OR_POSOBJ(list));
   199      GAP_ASSERT(pos >= 1);
   200      GAP_ASSERT(pos <= CAPACITY_PLIST(list));
   201      return CONST_ADDR_OBJ(list)[pos];
Target 0: (gap) stopped.
(lldb) p list
(Obj) $0 = 0x00001000045b7540
(lldb) p TNAM_OBJ(list)
(const Char *) $1 = 0x0000000000000000
(lldb) p TNUM_OBJ(list)
(UInt) $2 = 248
(lldb)
fingolfin commented 2 years ago

You could use bt to print a backtrace to see where it dies.

And here is what I got with GAP master in debug mode running a freshly compiled Digraphs master build:

#I  Using bliss by default for AutomorphismGroup . . .
Digraphs package: testinstall.tst
msecs: 215
#I  Using NautyTracesInterface by default for AutomorphismGroup
#I  bliss will be used for edge coloured automorphisms
Digraphs package: testinstall.tst
msecs: 137
#I  Using bliss by default for AutomorphismGroup . . .
#I  . . . and pretending that NautyTracesInterface is not available . .  .
Digraphs package: testinstall.tst
msecs: 132
#I  Using bliss by default for AutomorphismGroup . . .
Architecture: aarch64-apple-darwin21-default64-kv8

testing: /Users/mhorn/Projekte/GAP/repos/pkg/others/Digraphs/tst/standard/attr.tst
Assertion failed: (IS_PLIST_OR_POSOBJ(list)), function ELM_PLIST, file plist.h, line 200.
Abort
0   gap                                 0x00000001022f3224 BacktraceHandler + 44
1   libsystem_platform.dylib            0x00000001c0c1b4a4 _sigtramp + 56
2   libsystem_pthread.dylib             0x00000001c0c03ee0 pthread_kill + 288
3   libsystem_c.dylib                   0x00000001c0b3e340 abort + 168
4   libsystem_c.dylib                   0x00000001c0b3d754 err + 0
5   gap                                 0x00000001022f0640 ELM_PLIST + 76
6   digraphs.so                         0x000000010357dd28 FuncDIGRAPH_IN_OUT_NBS + 212
7   gap                                 0x00000001022eba30 CALL_1ARGS + 44
....

So it dies in FuncDIGRAPH_IN_OUT_NBS (but as above this does not have to be place where the bug is, if any)

james-d-mitchell commented 2 years ago

Yup, it dies in the same function for me, although not reliably, and not (so far) when I have Digraphs compiled in debug mode.

fingolfin commented 2 years ago

Hacking the Semigroups / Digraphs test code to use opts.testOptions.showProgress := true (instead of false) makes debugging a lot easier for me, as now after a crash I see what it executed last. This revealed a trivial way to trigger a crash: RandomDigraph(IsMutableDigraph, 1000);. Refining this a bit: RANDOM_DIGRAPH(10000, 8580);; crashes for me, each time. Since this is a single C function, the bug (or rather, a bug) must be in there. The crash is in the line adji = ELM_PLIST(adj, i); . In the debugger, inspecting the value of adj reveals that it points to a garbage object.

But this was when compiling with CFLAGS set to -g -Og. When I just used -Og, the crash vanished.

Oh, and also to my great surprise, inline functions like INT_INTOBJ (which turns into a single CPU instruction) were not inlined at -O2; only at -O3 (all of this is with Apple clang on an M1 mac). Really weird.

Annnyway: it is possible that this hints at a problem we feared might happen for a long time: namely some compiler optimizations interfering with our GC write guard CHANGED_BAG, and/or with our "hack" to let the GC see references on the stack and in registers.... :/ That would explain why it works w/o optimizations... I wanted to look at the machine code to see if I could find a hint as to what might cause it, and that lead to the above weird discovery... I have to get the kid from day care now, will investigate more later (or perhaps @james-d-mitchell or @ChrisJefferson have some brilliant ideas in the meantime)

ChrisJefferson commented 2 years ago

Has anyone seen this on a non-ARM Mac? I'm not yet seeing it on x86-64 (but I'm currently at a full day workshop, so might not have looked hard enough).

If it's just on ARM Macs, we might need to adjust how we go about "flushing" everything on to the stack -- we usually assume a longjmp bag is enough, but maybe we need some more stuff (just a random guess)

james-d-mitchell commented 2 years ago

Aha! It's good to hear that you have some idea what might be going on, but sounds like rather bad news otherwise. I can reproduce the RANDOM_DIGRAPH(10000, 8580);; crash. I'll investigate that and let you know what happens.

Another thing that might be related and can reliably be reproduced (I've had students who had this issue too), if you make a change to Digraphs or Semigroups, and re-compile, but don't delete the bin directory, then GAP seg faults on start up every time. This didn't used to happen on my previous Mac but I noticed it when I got a new machine with an ARM processor.

james-d-mitchell commented 2 years ago

One more observation: when I compile GAP with --enable-debug and with no further CFLAGS or CXXFLAGS specified at configure time, then RANDOM_DIGRAPH(10000, 8580);; does not seg fault.

fingolfin commented 2 years ago

The segfaults on macos you describe are due to code signing issues. I'll give a link with details once I am at a computer, and can probably also provide a workaround for the buildsystem

ChrisJefferson commented 2 years ago

Weirdly, I've been unable to produce any crash (except for the first one @fingolfin fixed).

Just checking, we are looking at GAP master branch with currently released package archive? I also tried master branch of digraphs and semigroups.

james-d-mitchell commented 2 years ago

That’s right @ChrisJefferson Gap master versus Semigroups released version, with GAP compiled with —enable-debug and -g3 and Digraphs (at least) compiled without any user defined flags.

james-d-mitchell commented 2 years ago

As reported above the issue arises in GAP 4.11.1 and Digraphs also, I didn’t check with any earlier versions of GAP.

ChrisJefferson commented 2 years ago

Could someone try removing the _TAB from line 1316, I don't think this will be a problem, but GAP does seem to assume in various places (list ASS_LIST), I think, that if a list has type _TAB it is dense, but it isn't dense when half-built. I'm not sure if this will cause problems, but there isn't much else in RANDOM_DIGRAPH!

fingolfin commented 2 years ago

re code signing issues, see https://github.com/gap-packages/anupq/issues/47#issuecomment-856296954

Re _TAB: I don't see how it could cause issues, considering that we only use SET_ELM_PLIST to build that plist, until it really is a table:

  adj = NEW_PLIST(T_PLIST_TAB, n);
  SET_LEN_PLIST(adj, n);

  for (i = 1; i <= n; i++) {
    SET_ELM_PLIST(adj, i, NEW_PLIST(T_PLIST_EMPTY, 0));
    CHANGED_BAG(adj);
  }

That code looks perfectly fine to me (well, at least under the rules we operated under for the past decade, on x86 ...).

fingolfin commented 2 years ago

Here is the source for _setjmp for arm64 / aarch64 on macOS: https://github.com/apple/darwin-libplatform/blob/main/src/setjmp/arm64/setjmp.s

#define JMP_r19_20  #0x00
#define JMP_r21_22  #0x10
#define JMP_r23_24  #0x20
#define JMP_r25_26  #0x30
#define JMP_r27_28  #0x40
#define JMP_fp_lr   #0x50
#define JMP_sp_rsvd #0x60 /* second field is reserved/unused */
#define JMP_d8_d9   #0x70
#define JMP_d10_d11 #0x80
#define JMP_d12_d13 #0x90
#define JMP_d14_d15 #0xA0
...

/* int _setjmp(jmp_buf env); */
ENTRY_POINT(__setjmp)
    mov     x12, sp
    _OS_PTR_MUNGE_TOKEN(x16, x16)
    _OS_PTR_MUNGE(x10, fp, x16)
    _OS_PTR_MUNGE(x11, lr, x16)
    _OS_PTR_MUNGE(x12, x12, x16)
    stp     x19, x20,   [x0, JMP_r19_20]
    stp     x21, x22,   [x0, JMP_r21_22]
    stp     x23, x24,   [x0, JMP_r23_24]
    stp     x25, x26,   [x0, JMP_r25_26]
    stp     x27, x28,   [x0, JMP_r27_28]
    stp     x10, x11,   [x0, JMP_fp_lr]
    str     x12,        [x0, JMP_sp_rsvd]
    stp     d8, d9,     [x0, JMP_d8_d9]
    stp     d10, d11,   [x0, JMP_d10_d11]
    stp     d12, d13,   [x0, JMP_d12_d13]
    stp     d14, d15,   [x0, JMP_d14_d15]
    mov     w0, #0
    ret

One can easily see that not all registers are saved, just some, matching the arm64 calling conventions (see e.g. here) which mandate that r19-r28, fp, lr and sp are saved; in addition d8-d15 are saved, which are the lower 64 bit of the floating point registers v8-v15; this also matches the guide.

And this should be enough anyway! That's whole point of setjmp after all: that it stores enough information to completely restore the state of the caller to a degree that it can be resumed; so it must store precisely those registers, nothing more, nothing less

So I could stop here; I don't think it is possible for setjmp to "not store all needed registers" (though of course it could store them in an "encrypted format, which would break us; i.e. imagine _setjmp would xor a random bit string into its jmp buffer; we'd be screwed. Luckily so far nobody seems to have deemed this a good idea... but who knows what future security mitigations will bring).

But of course I went on (admittedly in parts because I only realized this must be a dead end when it was already too late :joy:)

Here is a partially annotated disassembly of RANDOM_DIGRAPH. Note how adj is stored in register x19 which is stored by _setjmp, so at least in theory our setjmp trick in the GC should be working for it (yet I did see garbled content where adj was pointing in several of the crashes I inspected...).

Note that in this function, GC can only be triggered by NewBag and ASS_LIST. But the first NewBag is safe because it creates adj and we have no other reference before it which could be collected. The next NewBag comes from NEW_PLIST and is in the first loop, and the one value we need to preserve is adj in x19 which as I just argued should be preserved.

_FuncRANDOM_DIGRAPH:
       0:   stp x28, x27, [sp, #-96]!
       4:   stp x26, x25, [sp, #16]
       8:   stp x24, x23, [sp, #32]
       c:   stp x22, x21, [sp, #48]
      10:   stp x20, x19, [sp, #64]
      14:   stp x29, x30, [sp, #80]
      18:   add x29, sp, #80
      1c:   mov x21, x2
      20:   mov x20, x1
      24:   asr x22, x1, #2     ; n   = INT_INTOBJ(nn);
      28:   lsl x8, x22, #3
      2c:   add x1, x8, #8

; adj = NEW_PLIST(T_PLIST_TAB, n);
      30:   mov w0, #42         ; T_PLIST_TAB = 42
      34:   bl  _NewBag
      38:   mov x19, x0         ; x19 = adj
      3c:   ldr x8, [x0]
      40:   mov w9, #1
      44:   str x9, [x8]
      48:   and x8, x20, #0xfffffffffffffffc
      4c:   orr x8, x8, #0x1
      50:   ldr x9, [x0]
      54:   str x8, [x9]
      58:   cmp x20, #4         ; nn <= INTOBJ_INT(0) ?
      5c:   b.lo    _LABEL_END  ; skip to the end if  n is less than 1

      60:   asr x21, x21, #2    ; lim = INT_INTOBJ(limm);
      64:   mov w23, #1

; x24 = &YoungBags
      68:   adrp    x24, _YoungBags@GOTPAGE
      6c:   ldr x24, [x24, _YoungBags@GOTPAGEOFF]

; x25 = &ChangedBags
      70:   adrp    x25, _ChangedBags@GOTPAGE
      74:   ldr x25, [x25, _ChangedBags@GOTPAGEOFF]

      78:   mov w26, #1         ; i = 1   (note: w26 = lower 32 bits of x26)
      7c:   b   label_8c
first_loop:
      80:   add x26, x26, #1    ; i++
      84:   cmp x26, x22        ; i <= n ?
      88:   b.hi    0xd4

label_8c:
      8c:   mov w0, #34         ; T_PLIST_EMPTY = 34
      90:   mov w1, #8
; x0 = NEW_PLIST(T_PLIST_EMPTY, 0)
      94:   bl  _NewBag         ; new bag will be in x0
      98:   ldr x8, [x0]        ; x8 = ADDD_OBJ(x0)
      9c:   str x23, [x8]       ; *ADDR_OBJ(x0)

; SET_ELM_PLIST(adj, i, x0);
      a0:   ldr x8, [x19]       ; x8 = ADDR_OBJ(adj)
      a4:   str x0, [x8, x26, lsl #3]   ; *(ADDR_OBJ(adj) + i * 8) = x0

; CHANGED_BAG(adj)
      a8:   ldr x8, [x19]       ; x8 = ADDR_OBJ(adj)   ; repetition... code is not really optimized
      ac:   ldr x9, [x24]       ; x9 = YoungBags
      b0:   cmp x8, x9          ; if (CONST_PTR_BAG(adj) <= YoungBags ...
      b4:   b.hi    first_loop  ; 
      b8:   ldur    x9, [x8, #-8]   ; x9 = LINK_BAG(adj)
      bc:   cmp x9, x19         ; ... && LINK_BAG(adj) == adj
      c0:   b.ne    first_loop
; LINK_BAG(adj) = ChangedBags;
      c4:   ldr x9, [x25]
      c8:   stur    x9, [x8, #-8]
      cc:   str x19, [x25]      ; ChangedBags = adj
      d0:   b   first_loop

      d4:   cmp x20, #4
      d8:   b.hs    0xfc

_LABEL_END:
      dc:   mov x0, x19
      e0:   ldp x29, x30, [sp, #80]
      e4:   ldp x20, x19, [sp, #64]
      e8:   ldp x22, x21, [sp, #48]
      ec:   ldp x24, x23, [sp, #32]
      f0:   ldp x26, x25, [sp, #16]
      f4:   ldp x28, x27, [sp], #96
      f8:   ret

      fc:   mov w23, #1
     100:   mov w24, #35757
     104:   movk    w24, #26843, lsl #16
     108:   mov w25, #10000

; x26 = AssListFuncs
     10c:   adrp    x26, _AssListFuncs@GOTPAGE
     110:   ldr x26, [x26, _AssListFuncs@GOTPAGEOFF]
     114:   b   0x124
     118:   add x23, x23, #1        ; i++
     11c:   cmp x23, x22            ; compare i and n
     120:   b.hi    _LABEL_END
     124:   mov w20, #5             ; j = w20 = 5 = INTOBJ_INT(1)
     128:   mov w27, #1
     12c:   b   0x158
label_130:
     130:   mov x9, #0

label_134:
     134:   asr x8, x8, #2              ; x8 = INT_INTOBJ(x8) where x8 = LEN_PLIST(adji)
     138:   add x1, x8, #1              ; pos = x1 = x8 + 1 = len + 1
     13c:   ldr x8, [x26, x9, lsl #3]   ; AssListFuncs[x9]
     140:   mov x2, x20
     144:   blr x8                      ; AssListFuncs[x9](x0, x1, x2) with x0 = list = adji, x1=pos, x2=obj
; the above call may have caused GC, and destroyed registers

     148:   add x27, x27, #1
     14c:   add x20, x20, #4          ; "j++" but with j=x20 an intobj, not an int
     150:   cmp x27, x22
     154:   b.hi    0x118
     158:   bl  _rand
     15c:   smull   x8, w0, w24
     160:   lsr x9, x8, #63
     164:   asr x8, x8, #44
     168:   add w8, w8, w9
     16c:   msub    w8, w8, w25, w0
     170:   cmp x21, w8, sxtw
     174:   b.ls    0x148

; adji = ELM_PLIST(adj, i);
     178:   ldr x8, [x19]               ; x8 = ADDR_OBJ(adj)
     17c:   ldr x0, [x8, x23, lsl #3]   ; x0 = *(ADDR_OBJ(adj) + x23 * 8)   ; so x0 = adji

     180:   ldr x9, [x0]                ; x9 = ADDR_OBJ(adji)

; len = LEN_PLIST(adji);
     184:   ldr x8, [x9]                ; x8 = *ADDR_OBJ(adji)

; ASS_LIST(adji, len + 1, INTOBJ_INT(j))
; ->
  ; UInt tnum = TNUM_OBJ(list);
  ; if (FIRST_LIST_TNUM <= tnum && tnum <= LAST_LIST_TNUM &&
  ;     (tnum & IMMUTABLE)) {
  ;     ErrorMayQuit("List Assignment: <list> must be a mutable list", 0, 0);
  ; }
  ; (*AssListFuncs[TNUM_OBJ(list)])(list, pos, obj);

     188:   tbnz    w0, #0, label_130
     18c:   tbnz    w0, #1, label_error       ; tbnz = Test bit and Branch if Nonzero.
     190:   ldurh   w10, [x9, #-16]
     194:   and x9, x10, #0xff
     198:   sub x11, x9, #22
     19c:   and x10, x10, #0x1
     1a0:   cmp x11, #55
     1a4:   ccmp    x10, #0, #4, ls
     1a8:   b.eq    label_134
     1ac:   b   0x1b8
label_error:
     1b0:   mov w9, #5
     1b4:   b   0x134
; ErrorMayQuit("List Assignment: <list> must be a mutable list", 0, 0);
     1b8:   adrp    x0, l_.str@PAGE
     1bc:   add x0, x0, l_.str@PAGEOFF
     1c0:   mov x1, #0
     1c4:   mov x2, #0
     1c8:   bl  _ErrorMayQuit

Conclusion of all this: so far even the machine code (well, the parts I looked at closer, I admit I did not check all of it (yet?)) seems reasonable to me sigh.

ChrisJefferson commented 2 years ago

Actually, i think you are on to something with longjmp / setjmp. I remembered seeing that _OS_PTR_MUNGE.

There is a reference here: https://googleprojectzero.blogspot.com/2019/08/in-wild-ios-exploit-chain-4.html , look for _OS_PTR_MUNGE, I think they are XORing registers in longjmp/setjmp!

It might be it's not effecting registers we care about, but it does look very suspicious. It's hard to find full details of what's happening.

fingolfin commented 2 years ago

Found it! Well, at least the digraphs crash, I did not yet try this with semigroups...

Turns out GASMAN was broken (by me, of course 😭). Fix in https://github.com/gap-system/gap/pull/4978

fingolfin commented 2 years ago

So I feel quite embarrassed about this, but much more than that I feel relief that this has a logical and relatively harmless explanation, and we don't have to completely replace GASMAN or whatnot 😓 . And moreover I am glad we caught this before releasing GAP 4.12. Bullet dodged.

james-d-mitchell commented 2 years ago

Phew! Thanks for this! I'm going to take a pass through the kernel module of Semigroups just to check if there are any other other missing CHANGED_BAGs etc, then I'll make a new release. I'll also try to figure out if your change in https://github.com/gap-system/gap/pull/4978 fixes Semigroups or not.

fingolfin commented 2 years ago

Just to say, for me with that GAP fix, all crashes in Semigroups are now gone. Once you release 5.0.1, I'll watch in the PackageDistro repository for further faillures

fingolfin commented 2 years ago

@james-d-mitchell maybe just make 5.0.1 now, assuming it is not too much work; even if there are other errors, this one fix already would help, plus once 5.0.1 is out and merged into the PackageDistro, it'll get hammered there; the current crashes show up quite frequently, and if it stops crashing with 5.0.1, we are probably good; if not, we can still investigate more closely.

james-d-mitchell commented 2 years ago

Yup, I'm making a release as soon as the CI will pass.

james-d-mitchell commented 2 years ago

Released just now!