semigroups / Semigroups

The GAP package Semigroups
https://semigroups.github.io/Semigroups/
Other
23 stars 35 forks source link

Random crashes in CI tests #634

Closed wilfwilson closed 10 months ago

wilfwilson commented 4 years ago

Over the last couple of weeks, I have noticed sporadic but fairly frequent failed Travis jobs for Semigroups. These seem to be various kinds of actual crashes.

I have restarted all of these jobs as they happened, and they passed eventually. This suggests that it’s perhaps going to be difficult to reproduce locally and track down. But I’ll stop doing that and record the problems here, so that maybe we can work towards resolving this.

I haven’t noticed any patterns yet with which jobs are failing, and where. My guess would that be this is libsemigroups 1.x related - but who knows?!

32-bit / master / latest packages

32-bit / stable-4.11 / latest packages

32-bit / stable-4.9 / required packages

64-bit / stable-4.10 / required packages

64-bit / stable-4.11 / required packages

64-bit / master / latest packages

james-d-mitchell commented 4 years ago

I’ve noticed this too, at least one of the failures is harmless (we should still fix it though). I’m not super concerned about failures in the 32 bit tests, but I’ll in any case see if I can’t resolve this.

james-d-mitchell commented 4 years ago

Another one:

https://travis-ci.org/gap-packages/Semigroups/jobs/636284977?utm_medium=notification&utm_source=github_status

olexandr-konovalov commented 4 years ago

Seems the same what I observe in staging tests under https://travis-ci.org/gap-infra/

james-d-mitchell commented 4 years ago

Interestingly this appears to be a bug in Digraphs. I'm trying to figure out what it is exactly.

james-d-mitchell commented 4 years ago

This might also be related to #643

wilfwilson commented 4 years ago

Hope so 🙂

wilfwilson commented 4 years ago

Although the problem is still occurring after has been merged, so maybe not.

james-d-mitchell commented 4 years ago

So, I think I know what this is now, I've fixed all the bugs I found in libsemigroups and Semigroups that might be related to this issue, and the only thing remaining is in PR #650. In that PR I restrict Semigroups to running one travis job in 32 bit, and to use a hacked version of GAP that calls GASMAN(\"collect\"); between test files. I've run it 4 times before today (I think) and there have been no failures. It seems likely that the issue is that the VM on travis is genuinely running out of memory. The reason is that some objects appear to GAP's garbage collector as having a small size (the size of pointer or two) but that they really occupy much more space. Consequently, GASMAN doesn't garbage collect these large objects, and instead they accumulate until the VM is out of memory.

james-d-mitchell commented 4 years ago

A "proper" fix for this issue would be some mechanism in GAP for GASMAN to know the proper size of objects allocated via malloc, and an improper fix is to re-implement TestDirectory in Semigroups including the call to GASMAN(\"collect\"); between test files.

fingolfin commented 3 years ago

Just stumbled upon this issue by chance. Properly reflecting external memory pressure in a GC based system is a difficult problem, one we have with e.g. interfaces to other systems (e.g. NormalizInterface) as well.

Unfortunately it's not enough to just "tell GAP the actual size of the external objects", because there is no mechanism in GAP resp. GASMAN which could make use of that; GASMAN only knows and cares about its own heap.

One could add code which tracks an "extra size" for every object managed by the GASMAN, then T_SEMI objects could set that suitably, and GASMAN could try to keep track of the total of that... But then one also needs to figure out a threshold above which usage of external objects would trigger a GC. This would need to be configured, and anyway, what should it be set to? Of course one can come up all kinds of heuristics, but this is difficult in general.

An alternative option, at least for the Semigroups package, might be to use a custom C++ allocator, though, which wraps the default allocator, but whenever it encounters an "out of memory" situation/exception, it first triggers a GC collection; then it retries the allocation; and only gives up if that fails again. (It could also first try a partial GC; then retry; if that fails, try a full GC and retry; if that fails, give up).

Of course that would require that all your C++ classes resp. libsemigroups supports custom allocators, just like the C++ standard library...

A final alternative might be to use custom new/delete implementations, but that has its own share of concerns.

james-d-mitchell commented 3 years ago

@fingolfin thanks for the comments! I hadn't thought of using a custom allocator that triggers a GC in GAP if it gets an exception, that's a great idea (although it's likely to take a while to implement it).

Is there any scope for GAP replacing malloc so that all allocations made by code called from GAP would be in the GASMAN's heap? I don't know enough about how the GASMAN works to know if this is feasible at all, but seem to remember someone mentioning this some years ago.

fingolfin commented 3 years ago

The problem is that GASMAN is a moving GC, so it cannot be used as a plugin replacement for malloc etc..

However, both Boehm GC and the Julia GC can be used in this way. While Boehm GC was originally added for HPC-GAP, there is experimental support for using it in "plain" GAP. (The reason I call it experimental is that while I implemented it and made sure it worked back then, it's not something anybody uses regularly, AFAIK, and so I can't vouch for it...)

james-d-mitchell commented 3 years ago

Aha, right that makes a lot of sense, thanks for the explanation @fingolfin !

wilfwilson commented 3 years ago

Just to note that these crashes (unsurprisingly) still happen after moving from Travis to GitHub actions, here's an example (you can see the output of the failed bit by downloading the log):

In the acting-methods-off bit of SemigroupsTestStandard():

...
2020-12-18T13:05:54.4236722Z      968 ms (302 ms GC) and 112MB allocated for semieunit.tst
2020-12-18T13:05:54.4238346Z testing: /home/runner/work/Semigroups/Semigroups/gaproot/pkg/Semigroups/tst/st\
2020-12-18T13:05:54.4239629Z andard/semiex.tst
2020-12-18T13:05:55.5338611Z     1098 ms (332 ms GC) and 109MB allocated for semiex.tst
2020-12-18T13:05:55.5340234Z testing: /home/runner/work/Semigroups/Semigroups/gaproot/pkg/Semigroups/tst/st\
2020-12-18T13:05:55.5341555Z andard/semiffmat.tst
2020-12-18T13:06:07.6797331Z    12146 ms (361 ms GC) and 269MB allocated for semiffmat.tst
2020-12-18T13:06:07.6800306Z testing: /home/runner/work/Semigroups/Semigroups/gaproot/pkg/Semigroups/tst/st\
2020-12-18T13:06:07.6801971Z andard/semifp.tst
2020-12-18T15:10:12.0870666Z pkg-ci-scripts/run_tests.sh: line 38:   937 Killed                  $GAP ${GAP_TESTFILE:-tst/testall.g}
2020-12-18T15:10:12.5406507Z ##[error]Process completed with exit code 137.
wilfwilson commented 3 years ago

Just to note that this is unfortunately happening fairly often, e.g.

These ones seem to affect the stable-4.11 GAP branch rather than the master branch of GAP, though, which is encouraging.

I think I have also experienced locally on my computer, once. I can't remember the GAP version, or further details. I'll keep an eye out for it happening again.

fingolfin commented 3 years ago

BTW, if you use START_TEST in your tst files, that also triggers a GC

wilfwilson commented 3 years ago

Thanks @fingolfin, yes we do use START_TEST.

I'll just note for the record: the failures that we're getting on GitHub actions these days are manifesting themselves as timeouts. Perhaps GAP is still actually crashing, but the outward appearance is that it's hanging.

ChrisJefferson commented 3 years ago

So, if you ever ask digraphs for a homomorphism, or other thing which runs it's internal schreier sims, it immediately allocates 1.25GB which it never frees (this is in init_data_from_args, which calls new_schreier_sims). Basically, in a 32-bit OS, that amount of allocation is always going to cause problems. Also, none of the mallocs check for a 0 return value (which means failure), so I'm willing to bet on 32-bit OSes, sometimes they fail, and then later when you access the memory you dereference a null pointer.

I'm not saying there is particularly anything to fix here -- you could write a malloc wrapper which asserts on NULL being returned, so you at least get a better error message, or just not support 32-bit OSes.

failures on 64-bit systems I'm less sure about, and I can't (yet) reproduce them, or trigger them with various memory checking tests.

ChrisJefferson commented 3 years ago

One possible fix, could you set MAXVERTS to something like 128 instead of 512 on 32-bit systems (how often do people want to work on such big graphs, particularly on 32-bit systems)?

wilfwilson commented 3 years ago

Thank you for looking into this @ChrisJefferson.

(Funnily enough @james-d-mitchell and I, for probably the first time, were only yesterday remarking to each other that our C Schreier Sims implementation hadn't been causing any problems 😆).

I wrote the implementation of Schreier Sims in Digraphs in roughly 2015, before I knew anything. I'll take a look at it, but reducing MAXVERTS on 32-bit, and also being more sensible about how much memory to allocate in the first instance (using maths), sounds like a good idea.

wilfwilson commented 3 years ago

This still happens sometimes, including in the Azure Pipelines system. For example, in a 32-bit GAP stable-4.10 required-packages build:

testing: /home/gap/inst/gap-4.10.2/pkg/semigroups/tst/standard/semitrans.tst
/home/gap/inst/gap-4.10.2/pkg/semigroups/ci/docker-test.sh: line 76: 30210 Done                    echo "LoadPackage(\"semigroups\"); SemigroupsMakeDoc(); SemigroupsTestStandard(); SEMIGROUPS.TestManualExamples();"
     30211 Segmentation fault      (core dumped) | $GAPSH -A -x 80 -m 768m -o $MEM -T 2>&1
     30212                       | tee -a $TESTLOG
##[error]Bash exited with code '139'.
james-d-mitchell commented 10 months ago

Hopefully resolved in v5.3.0