Open thofma opened 10 months ago
No Idea
Some weird GC corruption that seems to happen when the Serialization/IPC tests happen, it seems related to julia tasks but I haven't been able to reproduce this locally. I have the long testset running in a loop with rr
to trigger and capture this (currently at about 100 iterations).
So far I got only one other crash but in the test group elliptic_surfaces.jl
that runs before the IPC stuff:
[4832] signal (11.1): Segmentation fault
in expression starting at /home/datastore/lorenz/software/julia/Oscar.jl/test/AlgebraicGeometry/Schemes/elliptic_surface.jl:1
jl_object_id__cold at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/builtins.c:455
type_hash at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:1575
typekey_hash at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:1605
jl_precompute_memoized_dt at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:1685
inst_datatype_inner at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:2081
jl_inst_arg_tuple_type at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:2176
arg_type_tuple at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2232 [inlined]
jl_lookup_generic_ at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3020 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3072
iterate at ./generator.jl:47 [inlined]
collect at ./array.jl:834
unknown function (ip: 0x1522095c16a5)
convert_return at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/caller.jl:216
unknown function (ip: 0x1522095c11c9)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
#197 at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/caller.jl:92
unknown function (ip: 0x1522095c141c)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
convert_normal_value at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/caller.jl:152
unknown function (ip: 0x1522095c1336)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
convert_return at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/caller.jl:223
unknown function (ip: 0x1522095c11c9)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
low_level_caller_rng at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/caller.jl:378
minAssGTZ at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/Meta.jl:45
unknown function (ip: 0x1522095c0389)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
#minimal_primes#335 at /home/datastore/lorenz/software/julia/Oscar.jl/src/Rings/mpoly-ideals.jl:830
minimal_primes at /home/datastore/lorenz/software/julia/Oscar.jl/src/Rings/mpoly-ideals.jl:818 [inlined]
__compute_is_prime__ at /home/datastore/lorenz/software/julia/Oscar.jl/src/Rings/mpoly-ideals.jl:1255
#356 at /home/datastore/lorenz/software/julia/depot/packages/AbstractAlgebra/R29qD/src/Attributes.jl:357 [inlined]
get! at ./dict.jl:479
get_attribute! at /home/datastore/lorenz/software/julia/depot/packages/AbstractAlgebra/R29qD/src/Attributes.jl:230 [inlined]
is_prime at /home/datastore/lorenz/software/julia/Oscar.jl/src/Rings/mpoly-ideals.jl:1254
unknown function (ip: 0x1521620272f5)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
__compute_is_prime__ at /home/datastore/lorenz/software/julia/Oscar.jl/src/Rings/mpolyquo-localizations.jl:1853
#914 at /home/datastore/lorenz/software/julia/depot/packages/AbstractAlgebra/R29qD/src/Attributes.jl:357 [inlined]
get! at ./dict.jl:479
unknown function (ip: 0x152162026da0)
typeinf_local
and deserialize
occur, perhaps something related to type inference in the deserialization, like compiler getting an unexpected type. Imagine something like this could happen in deserialization, but why only in this test?
The same crash as described by @benlorenz happened also in the corresponding test run for #3018 after the changes that were pushed yesterday.
The second backtrace reported in here by @benlorenz involves Singular.jl and the primdec library function minAssGTZ -- specifically the code in Singular.jl which converts its return value to Julia. Maybe there is a GC.preserve
missing there or some other bug. Perhaps it causes a memory corruption and then triggers the second crash, too... even if it not, that needs to be solved.
After digging into the first backtrace again, this is a GC corruption error (https://github.com/oscar-system/Oscar.jl/actions/runs/7364038493/job/20044077891#step:7:4949), so this could be due to the same issue.
I have a preliminary fix for the crash I reported (jl_object_id__cold
) here: https://github.com/oscar-system/Singular.jl/pull/749. This adds a missing GC protection in the libsingular_julia
code for passing data from a sleftv
back to julia. I want to do some further testing now, unfortunately (for me at least ...) these crashes are rather rare.
The original error (GC error (probable corruption)
) also happens on macos, observed during my flint 2.9 backport testing:
https://github.com/benlorenz/Oscar.jl/actions/runs/7583365592/job/20655088309#step:9:4981
(but even less often than on ubuntu)
Another occurence (on macOS): https://github.com/oscar-system/Oscar.jl/actions/runs/7585991305/job/20663114060?pr=3213
In both recent occurrences, the crash happend shortly after we see
Testing test/AlgebraicGeometry/Schemes/elliptic_surface.jl [...]
which I think means it is probably in the middle of testing test/Serialization/IPC.jl
? (There is no message "Starting tests for ..." before that, perhaps we could add such a message?)
Specifically, if we add a "Starting tests..." message before loading IPC.jl, and also force a full GC before that message, then perhaps we can get a better idea as to whether the corruption happens before IPC.jl, or during it?
I can add the message, but I would like to hold off a bit with adding something like an explicit GC now since we just started doing the tests with libsingular_julia 0.40.11 which is the first version including my sleftv fix. (At least until we see another error with that version...)
It still happens with the new libsingular and even with the explicit GC call it happens within the IPC.jl tests: https://github.com/oscar-system/Oscar.jl/actions/runs/7638814195/job/20810486432?pr=3229#step:8:4959 Unfortunately I haven't been able to reproduce this crash outside of github actions. I have two jobs running the long testsuite with 300 successful iterations so far.
Also happened https://github.com/oscar-system/Oscar.jl/actions/runs/7638161558/job/20808482695?pr=3226
Could it be that it again can only reproduced on a memory starved machine, with 7-8 GB RAM?
The workers should be less memory starved now, they were recently upgraded to have 4 CPUs and 16 GB of memory.
A recent crash is reported at https://github.com/oscar-system/Oscar.jl/actions/runs/7642653452/job/20822790076?pr=3236
I have opened a PR to disable the IPC test for now while I try to debug this further: https://github.com/oscar-system/Oscar.jl/pull/3246
And herr is an instance of the crash with Julia 1.9: https://github.com/oscar-system/Oscar.jl/actions/runs/7665378425/job/20891166477?pr=3247
Thanks for noticing. That is interesting, it turns out that the effect of doing GC.gc()
before the IPC.jl tests seems to increase the rate at which the error occurs. (But still only on github actions so far ...)
Maybe that helped trigger this on 1.9 as well.
Our CI looks a lot better now without the IPC.jl tests, which should help with development. But I am continuing to look into this. Please post any further errors you notice in the CI.
I just found this one during QuadFormAndIsom
, unfortunately without any backtrace:
Sat, 27 Jan 2024 14:58:45 GMT GC: pause 27.39ms. collected 39.011118MB. incr
Sat, 27 Jan 2024 14:58:45 GMT corrupted double-linked list
Sat, 27 Jan 2024 14:58:45 GMT
Sat, 27 Jan 2024 14:58:45 GMT [1921] signal (6.-6): Aborted
Sat, 27 Jan 2024 14:58:45 GMT in expression starting at /home/runner/work/Oscar.jl/Oscar.jl/experimental/QuadFormAndIsom/test/runtests.jl:269
Sat, 27 Jan 2024 17:09:25 GMT Error: The operation was canceled.
from https://github.com/oscar-system/Oscar.jl/actions/runs/7679187557/job/20929824694?pr=3212#step:8:1790
After some more debugging I found that the error will quite surely be gone once 1.10.1 is released, fixed via JuliaLang/julia@8a04df0
(#52755). I don't really now why this happens so much more on 1.10 but probably due to the more agressive GC.
In this workflow I have about 150 successful runs of the long group including the IPC.jl tests, with an intermediate julia build from the backports-release-1.10
branch.
So once that is released I will try to reactivate these tests and hopefully close this ticket.
This is back: https://github.com/Nemocas/Nemo.jl/actions/runs/8546742962/job/23417708965?pr=1700
(This downstream test run only checks Oscar.)
If one looks at https://github.com/oscar-system/Oscar.jl/commits/master/, one sees that often "Run tests / test (~1.10.0-0, long, ubuntu-latest) (push)" fails. The error looks scary, e.g. in https://github.com/oscar-system/Oscar.jl/actions/runs/7364038493/job/20044077891#step:7:4952 and https://github.com/oscar-system/Oscar.jl/actions/runs/7364038493/job/20044077891#step:7:26094:
Does anyone have an idea where that might be coming from? I have not tried to reproduce it locally. It does not look like https://github.com/oscar-system/Oscar.jl/issues/2441.
CC: @lgoettgens @benlorenz