oscar-system / Oscar.jl

A comprehensive open source computer algebra system for computations in algebra, geometry, and number theory.
https://www.oscar-system.org
Other
344 stars 126 forks source link

Error on 1.10 ubuntu long #3184

Open thofma opened 10 months ago

thofma commented 10 months ago

If one looks at https://github.com/oscar-system/Oscar.jl/commits/master/, one sees that often "Run tests / test (~1.10.0-0, long, ubuntu-latest) (push)" fails. The error looks scary, e.g. in https://github.com/oscar-system/Oscar.jl/actions/runs/7364038493/job/20044077891#step:7:4952 and https://github.com/oscar-system/Oscar.jl/actions/runs/7364038493/job/20044077891#step:7:26094:

!!! ERROR in jl_ -- ABORTING !!!

Does anyone have an idea where that might be coming from? I have not tried to reproduce it locally. It does not look like https://github.com/oscar-system/Oscar.jl/issues/2441.

CC: @lgoettgens @benlorenz

lgoettgens commented 10 months ago

No Idea

benlorenz commented 10 months ago

Some weird GC corruption that seems to happen when the Serialization/IPC tests happen, it seems related to julia tasks but I haven't been able to reproduce this locally. I have the long testset running in a loop with rr to trigger and capture this (currently at about 100 iterations).

So far I got only one other crash but in the test group elliptic_surfaces.jl that runs before the IPC stuff:

[4832] signal (11.1): Segmentation fault
in expression starting at /home/datastore/lorenz/software/julia/Oscar.jl/test/AlgebraicGeometry/Schemes/elliptic_surface.jl:1
jl_object_id__cold at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/builtins.c:455
type_hash at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:1575
typekey_hash at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:1605
jl_precompute_memoized_dt at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:1685
inst_datatype_inner at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:2081
jl_inst_arg_tuple_type at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:2176
arg_type_tuple at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2232 [inlined]
jl_lookup_generic_ at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3020 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3072
iterate at ./generator.jl:47 [inlined]
collect at ./array.jl:834
unknown function (ip: 0x1522095c16a5)
convert_return at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/caller.jl:216
unknown function (ip: 0x1522095c11c9)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
#197 at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/caller.jl:92
unknown function (ip: 0x1522095c141c)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
convert_normal_value at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/caller.jl:152
unknown function (ip: 0x1522095c1336)    
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
convert_return at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/caller.jl:223
unknown function (ip: 0x1522095c11c9)    
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
low_level_caller_rng at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/caller.jl:378
minAssGTZ at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/Meta.jl:45
unknown function (ip: 0x1522095c0389)    
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
#minimal_primes#335 at /home/datastore/lorenz/software/julia/Oscar.jl/src/Rings/mpoly-ideals.jl:830
minimal_primes at /home/datastore/lorenz/software/julia/Oscar.jl/src/Rings/mpoly-ideals.jl:818 [inlined]
__compute_is_prime__ at /home/datastore/lorenz/software/julia/Oscar.jl/src/Rings/mpoly-ideals.jl:1255
#356 at /home/datastore/lorenz/software/julia/depot/packages/AbstractAlgebra/R29qD/src/Attributes.jl:357 [inlined]
get! at ./dict.jl:479    
get_attribute! at /home/datastore/lorenz/software/julia/depot/packages/AbstractAlgebra/R29qD/src/Attributes.jl:230 [inlined]
is_prime at /home/datastore/lorenz/software/julia/Oscar.jl/src/Rings/mpoly-ideals.jl:1254
unknown function (ip: 0x1521620272f5)    
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
__compute_is_prime__ at /home/datastore/lorenz/software/julia/Oscar.jl/src/Rings/mpolyquo-localizations.jl:1853
#914 at /home/datastore/lorenz/software/julia/depot/packages/AbstractAlgebra/R29qD/src/Attributes.jl:357 [inlined]
get! at ./dict.jl:479                    
unknown function (ip: 0x152162026da0)    
jankoboehm commented 10 months ago

typeinf_local and deserialize occur, perhaps something related to type inference in the deserialization, like compiler getting an unexpected type. Imagine something like this could happen in deserialization, but why only in this test?

ThomasBreuer commented 9 months ago

The same crash as described by @benlorenz happened also in the corresponding test run for #3018 after the changes that were pushed yesterday.

fingolfin commented 9 months ago

The second backtrace reported in here by @benlorenz involves Singular.jl and the primdec library function minAssGTZ -- specifically the code in Singular.jl which converts its return value to Julia. Maybe there is a GC.preserve missing there or some other bug. Perhaps it causes a memory corruption and then triggers the second crash, too... even if it not, that needs to be solved.

lgoettgens commented 9 months ago

After digging into the first backtrace again, this is a GC corruption error (https://github.com/oscar-system/Oscar.jl/actions/runs/7364038493/job/20044077891#step:7:4949), so this could be due to the same issue.

benlorenz commented 9 months ago

I have a preliminary fix for the crash I reported (jl_object_id__cold) here: https://github.com/oscar-system/Singular.jl/pull/749. This adds a missing GC protection in the libsingular_julia code for passing data from a sleftv back to julia. I want to do some further testing now, unfortunately (for me at least ...) these crashes are rather rare.

benlorenz commented 9 months ago

The original error (GC error (probable corruption)) also happens on macos, observed during my flint 2.9 backport testing: https://github.com/benlorenz/Oscar.jl/actions/runs/7583365592/job/20655088309#step:9:4981 (but even less often than on ubuntu)

joschmitt commented 9 months ago

Another occurence (on macOS): https://github.com/oscar-system/Oscar.jl/actions/runs/7585991305/job/20663114060?pr=3213

fingolfin commented 9 months ago

In both recent occurrences, the crash happend shortly after we see

Testing test/AlgebraicGeometry/Schemes/elliptic_surface.jl [...]

which I think means it is probably in the middle of testing test/Serialization/IPC.jl? (There is no message "Starting tests for ..." before that, perhaps we could add such a message?)

fingolfin commented 9 months ago

Specifically, if we add a "Starting tests..." message before loading IPC.jl, and also force a full GC before that message, then perhaps we can get a better idea as to whether the corruption happens before IPC.jl, or during it?

benlorenz commented 9 months ago

I can add the message, but I would like to hold off a bit with adding something like an explicit GC now since we just started doing the tests with libsingular_julia 0.40.11 which is the first version including my sleftv fix. (At least until we see another error with that version...)

benlorenz commented 9 months ago

It still happens with the new libsingular and even with the explicit GC call it happens within the IPC.jl tests: https://github.com/oscar-system/Oscar.jl/actions/runs/7638814195/job/20810486432?pr=3229#step:8:4959 Unfortunately I haven't been able to reproduce this crash outside of github actions. I have two jobs running the long testsuite with 300 successful iterations so far.

fingolfin commented 9 months ago

Also happened https://github.com/oscar-system/Oscar.jl/actions/runs/7638161558/job/20808482695?pr=3226

Could it be that it again can only reproduced on a memory starved machine, with 7-8 GB RAM?

benlorenz commented 9 months ago

The workers should be less memory starved now, they were recently upgraded to have 4 CPUs and 16 GB of memory.

ThomasBreuer commented 9 months ago

A recent crash is reported at https://github.com/oscar-system/Oscar.jl/actions/runs/7642653452/job/20822790076?pr=3236

benlorenz commented 9 months ago

I have opened a PR to disable the IPC test for now while I try to debug this further: https://github.com/oscar-system/Oscar.jl/pull/3246

fingolfin commented 9 months ago

And herr is an instance of the crash with Julia 1.9: https://github.com/oscar-system/Oscar.jl/actions/runs/7665378425/job/20891166477?pr=3247

benlorenz commented 9 months ago

Thanks for noticing. That is interesting, it turns out that the effect of doing GC.gc() before the IPC.jl tests seems to increase the rate at which the error occurs. (But still only on github actions so far ...) Maybe that helped trigger this on 1.9 as well.

benlorenz commented 9 months ago

Our CI looks a lot better now without the IPC.jl tests, which should help with development. But I am continuing to look into this. Please post any further errors you notice in the CI.

I just found this one during QuadFormAndIsom, unfortunately without any backtrace:

Sat, 27 Jan 2024 14:58:45 GMT GC: pause 27.39ms. collected 39.011118MB. incr 
Sat, 27 Jan 2024 14:58:45 GMT corrupted double-linked list
Sat, 27 Jan 2024 14:58:45 GMT
Sat, 27 Jan 2024 14:58:45 GMT [1921] signal (6.-6): Aborted
Sat, 27 Jan 2024 14:58:45 GMT in expression starting at /home/runner/work/Oscar.jl/Oscar.jl/experimental/QuadFormAndIsom/test/runtests.jl:269
Sat, 27 Jan 2024 17:09:25 GMT Error: The operation was canceled.

from https://github.com/oscar-system/Oscar.jl/actions/runs/7679187557/job/20929824694?pr=3212#step:8:1790

benlorenz commented 9 months ago

After some more debugging I found that the error will quite surely be gone once 1.10.1 is released, fixed via JuliaLang/julia@8a04df0 (#52755). I don't really now why this happens so much more on 1.10 but probably due to the more agressive GC.

In this workflow I have about 150 successful runs of the long group including the IPC.jl tests, with an intermediate julia build from the backports-release-1.10 branch.

So once that is released I will try to reactivate these tests and hopefully close this ticket.

thofma commented 7 months ago

This is back: https://github.com/Nemocas/Nemo.jl/actions/runs/8546742962/job/23417708965?pr=1700

(This downstream test run only checks Oscar.)