Closed wilfwilson closed 10 months ago
I’ve noticed this too, at least one of the failures is harmless (we should still fix it though). I’m not super concerned about failures in the 32 bit tests, but I’ll in any case see if I can’t resolve this.
Seems the same what I observe in staging tests under https://travis-ci.org/gap-infra/
Interestingly this appears to be a bug in Digraphs. I'm trying to figure out what it is exactly.
This might also be related to #643
Hope so 🙂
Although the problem is still occurring after has been merged, so maybe not.
So, I think I know what this is now, I've fixed all the bugs I found in libsemigroups and Semigroups that might be related to this issue, and the only thing remaining is in PR #650. In that PR I restrict Semigroups to running one travis job in 32 bit, and to use a hacked version of GAP that calls GASMAN(\"collect\");
between test files. I've run it 4 times before today (I think) and there have been no failures. It seems likely that the issue is that the VM on travis is genuinely running out of memory. The reason is that some objects appear to GAP's garbage collector as having a small size (the size of pointer or two) but that they really occupy much more space. Consequently, GASMAN doesn't garbage collect these large objects, and instead they accumulate until the VM is out of memory.
A "proper" fix for this issue would be some mechanism in GAP for GASMAN to know the proper size of objects allocated via malloc
, and an improper fix is to re-implement TestDirectory
in Semigroups including the call to GASMAN(\"collect\");
between test files.
Just stumbled upon this issue by chance. Properly reflecting external memory pressure in a GC based system is a difficult problem, one we have with e.g. interfaces to other systems (e.g. NormalizInterface
) as well.
Unfortunately it's not enough to just "tell GAP the actual size of the external objects", because there is no mechanism in GAP resp. GASMAN which could make use of that; GASMAN only knows and cares about its own heap.
One could add code which tracks an "extra size" for every object managed by the GASMAN, then T_SEMI
objects could set that suitably, and GASMAN could try to keep track of the total of that... But then one also needs to figure out a threshold above which usage of external objects would trigger a GC. This would need to be configured, and anyway, what should it be set to? Of course one can come up all kinds of heuristics, but this is difficult in general.
An alternative option, at least for the Semigroups package, might be to use a custom C++ allocator, though, which wraps the default allocator, but whenever it encounters an "out of memory" situation/exception, it first triggers a GC collection; then it retries the allocation; and only gives up if that fails again. (It could also first try a partial GC; then retry; if that fails, try a full GC and retry; if that fails, give up).
Of course that would require that all your C++ classes resp. libsemigroups supports custom allocators, just like the C++ standard library...
A final alternative might be to use custom new/delete implementations, but that has its own share of concerns.
@fingolfin thanks for the comments! I hadn't thought of using a custom allocator that triggers a GC in GAP if it gets an exception, that's a great idea (although it's likely to take a while to implement it).
Is there any scope for GAP replacing malloc
so that all allocations made by code called from GAP would be in the GASMAN's heap? I don't know enough about how the GASMAN works to know if this is feasible at all, but seem to remember someone mentioning this some years ago.
The problem is that GASMAN is a moving GC, so it cannot be used as a plugin replacement for malloc
etc..
However, both Boehm GC and the Julia GC can be used in this way. While Boehm GC was originally added for HPC-GAP, there is experimental support for using it in "plain" GAP. (The reason I call it experimental is that while I implemented it and made sure it worked back then, it's not something anybody uses regularly, AFAIK, and so I can't vouch for it...)
Aha, right that makes a lot of sense, thanks for the explanation @fingolfin !
Just to note that these crashes (unsurprisingly) still happen after moving from Travis to GitHub actions, here's an example (you can see the output of the failed bit by downloading the log):
In the acting-methods-off bit of SemigroupsTestStandard()
:
...
2020-12-18T13:05:54.4236722Z 968 ms (302 ms GC) and 112MB allocated for semieunit.tst
2020-12-18T13:05:54.4238346Z testing: /home/runner/work/Semigroups/Semigroups/gaproot/pkg/Semigroups/tst/st\
2020-12-18T13:05:54.4239629Z andard/semiex.tst
2020-12-18T13:05:55.5338611Z 1098 ms (332 ms GC) and 109MB allocated for semiex.tst
2020-12-18T13:05:55.5340234Z testing: /home/runner/work/Semigroups/Semigroups/gaproot/pkg/Semigroups/tst/st\
2020-12-18T13:05:55.5341555Z andard/semiffmat.tst
2020-12-18T13:06:07.6797331Z 12146 ms (361 ms GC) and 269MB allocated for semiffmat.tst
2020-12-18T13:06:07.6800306Z testing: /home/runner/work/Semigroups/Semigroups/gaproot/pkg/Semigroups/tst/st\
2020-12-18T13:06:07.6801971Z andard/semifp.tst
2020-12-18T15:10:12.0870666Z pkg-ci-scripts/run_tests.sh: line 38: 937 Killed $GAP ${GAP_TESTFILE:-tst/testall.g}
2020-12-18T15:10:12.5406507Z ##[error]Process completed with exit code 137.
Just to note that this is unfortunately happening fairly often, e.g.
These ones seem to affect the stable-4.11
GAP branch rather than the master
branch of GAP, though, which is encouraging.
I think I have also experienced locally on my computer, once. I can't remember the GAP version, or further details. I'll keep an eye out for it happening again.
BTW, if you use START_TEST in your tst files, that also triggers a GC
Thanks @fingolfin, yes we do use START_TEST
.
I'll just note for the record: the failures that we're getting on GitHub actions these days are manifesting themselves as timeouts. Perhaps GAP is still actually crashing, but the outward appearance is that it's hanging.
So, if you ever ask digraphs for a homomorphism, or other thing which runs it's internal schreier sims, it immediately allocates 1.25GB which it never frees (this is in init_data_from_args, which calls new_schreier_sims). Basically, in a 32-bit OS, that amount of allocation is always going to cause problems. Also, none of the mallocs check for a 0 return value (which means failure), so I'm willing to bet on 32-bit OSes, sometimes they fail, and then later when you access the memory you dereference a null pointer.
I'm not saying there is particularly anything to fix here -- you could write a malloc wrapper which asserts on NULL being returned, so you at least get a better error message, or just not support 32-bit OSes.
failures on 64-bit systems I'm less sure about, and I can't (yet) reproduce them, or trigger them with various memory checking tests.
One possible fix, could you set MAXVERTS to something like 128 instead of 512 on 32-bit systems (how often do people want to work on such big graphs, particularly on 32-bit systems)?
Thank you for looking into this @ChrisJefferson.
(Funnily enough @james-d-mitchell and I, for probably the first time, were only yesterday remarking to each other that our C Schreier Sims implementation hadn't been causing any problems 😆).
I wrote the implementation of Schreier Sims in Digraphs in roughly 2015, before I knew anything. I'll take a look at it, but reducing MAXVERTS
on 32-bit, and also being more sensible about how much memory to allocate in the first instance (using maths), sounds like a good idea.
This still happens sometimes, including in the Azure Pipelines system. For example, in a 32-bit GAP stable-4.10 required-packages build:
testing: /home/gap/inst/gap-4.10.2/pkg/semigroups/tst/standard/semitrans.tst
/home/gap/inst/gap-4.10.2/pkg/semigroups/ci/docker-test.sh: line 76: 30210 Done echo "LoadPackage(\"semigroups\"); SemigroupsMakeDoc(); SemigroupsTestStandard(); SEMIGROUPS.TestManualExamples();"
30211 Segmentation fault (core dumped) | $GAPSH -A -x 80 -m 768m -o $MEM -T 2>&1
30212 | tee -a $TESTLOG
##[error]Bash exited with code '139'.
Hopefully resolved in v5.3.0
Over the last couple of weeks, I have noticed sporadic but fairly frequent failed Travis jobs for Semigroups. These seem to be various kinds of actual crashes.
I have restarted all of these jobs as they happened, and they passed eventually. This suggests that it’s perhaps going to be difficult to reproduce locally and track down. But I’ll stop doing that and record the problems here, so that maybe we can work towards resolving this.
I haven’t noticed any patterns yet with which jobs are failing, and where. My guess would that be this is libsemigroups 1.x related - but who knows?!
32-bit / master / latest packages
32-bit / stable-4.11 / latest packages
32-bit / stable-4.9 / required packages
64-bit / stable-4.10 / required packages
64-bit / stable-4.11 / required packages
64-bit / master / latest packages