Closed lgoettgens closed 6 months ago
After "wall of chambers" comes
Test Summary: | Pass Total
K3 surface automorphism groups | 15 15
-> Testing ./AlgebraicGeometry/Surfaces/K3Auto.jl took: 154.803 seconds, 31089.35 MB
(copied from a different run), so I am not too surprised that this might get killed. Same for the second log you posted, where the next test would be
Test Summary: | Pass Total Time
Line bundle cohomologies and vanishing sets (set_attributes = true) | 30 30 27.9s
Test Summary: | Pass Total Time
Line bundle cohomologies and vanishing sets (set_attributes = false) | 30 30 24.3s
-> Testing AlgebraicGeometry/ToricVarieties/line_bundle_cohomologies.jl took: runtime 50.026 seconds + compilation 7.299 seconds + recompilation 0.000 seconds, 12774.31 MB
See #2570 for some more examples of this, this does not only happen with julia 1.8 but also 1.6 and 1.10.
So far we have seen this happen:
AlgebraicGeometry/ToricVarieties/line_bundle_cohomologies.jl
: https://github.com/oscar-system/Oscar.jl/actions/runs/5133045620/jobs/9235131573, more logs linked in #2570, cc: @HereAround AlgebraicGeometry/Schemes/Glueing.jl
: oscar-system/Oscar.jl/actions/runs/5592619727/jobs/10225306056#step:6:3051, cc: @HechtiDerLachs AlgebraicGeometry/Surfaces/K3Auto.jl
: https://github.com/oscar-system/Oscar.jl/actions/runs/5145822885/jobs/9263999579, cc: @simonbrandhorst Feel free to add more examples.
Is it possible to get more information? Like did the process OOM or did it take too long? There is some randomization involved in K3Auto.jl. I am wondering where it fails exactly. Maybe there is a bug somewhere. Can this be reproduced using some known random seed?
I can try to get the runtime of the test faster. But I think that more of 50% of it is already compilation time. So how to proceed?
I think the process was killed due to an OOM situation. According to the timestamps in the log (can be turned on via the settings knob in the top right of that page) this was about 1.5 minutes after the walls of chamber
testset was complete. (The total duration of that job is about 1 hour which is far away from any time limit).
Unfortunately we can't get any more information from this failure.
Maybe you can adapt the code to use a custom seeded rng, e.g. Oscar.get_seeded_rng()
, then we might be able to reproduce the error if it happens again (the seed for this is fixed in the CI but one could easily run the test in a loop with different seeds to find bad examples).
For the default julia rng knowing the seed would probably not really help as there might be an unknown + non-deterministic number of other uses which change the rng.
I cannot say much towards the detailed cause or a possible resolution. However, I can share a few insights onto AlgebraicGeometry/ToricVarieties/line_bundle_cohomologies.jl
, which seems to be the place where this segfault is triggered quite often. (In fact, I experienced this myself the other day and considered it a fluke. At that time however, I was not aware of this issue yet.)
AlgebraicGeometry/ToricVarieties/line_bundle_cohomologies.jl
have not been changed in the last couple of months. I would thus think that this either has always been a problem and we did not notice it, or that something else changed more recently and is now causing this.cohomCalg
to execute a few tasks. This is cpp software (cf. https://github.com/BenjaminJurke/cohomCalg) and @fingolfin wrapped this for me about 2 years (?) back. There are some complex combinatoric tasks execute by this software, which can become fairly challenging. However, the examples chosen in this test file are sufficiently elementary, that I would like to say these computations must be executable on essentially any computer.As said, I am not sure about the cause. But maybe this information helps @benlorenz to guess/track down the cause for this?
The original report in this ticket was from the first of June, the last change in the line_bundle_cohomologies test file was just a week earlier: https://github.com/oscar-system/Oscar.jl/pull/2396
Please note that this is not a segfault but the process seems to run out of memory.
Since julia reports about 12GB of allocations for that test-file the memory-intensive code seems to use julia memory allocations, i.e. it is probably not cohomCalg.
A good start might be to check which of the test-groups in that file need a lot of allocations, e.g. using @time
?
Another new case:
-> Testing AlgebraicGeometry/Schemes/CoherentSheaves.jl took: 78.817 seconds, 5119.89 MB
signal (15): Terminated
in expression starting at none:1
epoll_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
from the OscarCI tests in GAP: https://github.com/oscar-system/GAP.jl/actions/runs/5608123338/jobs/10260113884#step:7:1471
The next testgroup after CoherentSheaves.jl would be elliptic_surface.jl which also needs quite a lot of memory allocations:
-> Testing AlgebraicGeometry/Schemes/CoherentSheaves.jl took: runtime 11.783 seconds + compilation 74.945 seconds + recompilation 0.000 seconds, 3271.67 MB
Test Summary: | Pass Total Time
elliptic surfaces | 2 2 5m10.8s
-> Testing AlgebraicGeometry/Schemes/elliptic_surface.jl took: runtime 151.211 seconds + compilation 159.639 seconds + recompilation 0.000 seconds, 15826.34 MB
@benlorenz I stand corrected. Thank you for reminding me of this.
I could narrow this down to a single line. This line would compute all monomials of a certain degree in a certain graded ring (a.k.a. a homogeneous_component
). I executed AlgebraicGeometry/ToricVarieties/line_bundle_cohomologies.jl
with @time. Here is what I find:
42.267645 seconds (277.45 M allocations: 12.313 GiB, 37.02% gc time)
3.380005 seconds (485.06 k allocations: 15.583 MiB)
So a fix could be to just remove this line from the tests. A corresponding PR is here: https://github.com/oscar-system/Oscar.jl/pull/2579.
Alternatively, we could of course dive deeper and see if the homogeneous component computation could be improved. If the latter is interesting, it can be reproduced as follows:
julia> ray_generators = [[1, 0, 0,-2,-3], [0, 1, 0,-2,-3], [0, 0, 1,-2,-3], [-1,-1,-1,-2,-3], [0, 0, 0, 1, 0], [0, 0, 0, 0, 1], [0, 0, 0,-2,-3]];
julia> max_cones = [[1, 2, 3, 5, 6], [1, 2, 3, 5, 7], [1, 2, 3, 6, 7], [2, 3, 4, 5, 6], [2, 3, 4, 5, 7], [2, 3, 4, 6, 7], [1, 3, 4, 5, 6], [1, 3, 4, 5, 7], [1, 3, 4, 6, 7], [1, 2, 4, 5, 6], [1, 2, 4, 5, 7], [1, 2, 4, 6, 7]];
julia> weierstrass_over_p3 = normal_toric_variety(ray_generators, max_cones; non_redundant = true);
julia> R = cox_ring(weierstrass_over_p3)
Multivariate polynomial ring in 7 variables over QQ graded by
x1 -> [1 0]
x2 -> [1 0]
x3 -> [1 0]
x4 -> [1 0]
x5 -> [0 2]
x6 -> [0 3]
x7 -> [-4 1]
Of this ring R
, I wish to compute all monomials of degree (0,4). This is a vector space of dimension 4551. Specifically, for this, I would have executed the following commands:
julia> l4 = anticanonical_bundle(weierstrass_over_p3);
julia> b = basis_of_global_sections(l4);
julia> length(b) == 4551
true
Let me cc @wdecker regarding the computation of homogeneous components, just in case.
The elliptic surface stuff is a really complex data structure and computation. We worked hard to get this test running in 2 minutes. I am afraid that the test cannot be simplified much further. Plus it takes more time for compilation than for computation right now. Not sure what can be done about the allocations though. There is probably room for improvement all over the place.
Alternatively, we could of course dive deeper and see if the homogeneous component computation could be improved. If the latter is interesting, it can be reproduced as follows:
julia> ray_generators = [[1, 0, 0,-2,-3], [0, 1, 0,-2,-3], [0, 0, 1,-2,-3], [-1,-1,-1,-2,-3], [0, 0, 0, 1, 0], [0, 0, 0, 0, 1], [0, 0, 0,-2,-3]]; julia> max_cones = [[1, 2, 3, 5, 6], [1, 2, 3, 5, 7], [1, 2, 3, 6, 7], [2, 3, 4, 5, 6], [2, 3, 4, 5, 7], [2, 3, 4, 6, 7], [1, 3, 4, 5, 6], [1, 3, 4, 5, 7], [1, 3, 4, 6, 7], [1, 2, 4, 5, 6], [1, 2, 4, 5, 7], [1, 2, 4, 6, 7]]; julia> weierstrass_over_p3 = normal_toric_variety(ray_generators, max_cones; non_redundant = true); julia> R = cox_ring(weierstrass_over_p3) Multivariate polynomial ring in 7 variables over QQ graded by x1 -> [1 0] x2 -> [1 0] x3 -> [1 0] x4 -> [1 0] x5 -> [0 2] x6 -> [0 3] x7 -> [-4 1]
Of this ring
R
, I wish to compute all monomials of degree (0,4). This is a vector space of dimension 4551. Specifically, for this, I would have executed the following commands:julia> l4 = anticanonical_bundle(weierstrass_over_p3); julia> b = basis_of_global_sections(l4); julia> length(b) == 4551 true
julia> b[1:5]
5-element Vector{MPolyDecRingElem{QQFieldElem, QQMPolyRingElem}}:
x6^2
x5^3
x4^4*x5*x6*x7
x4^8*x5^2*x7^2
x4^12*x6*x7^3
This looks like degree (0,6)
to me?
To me this feels like integer points in a polyhedron given by some non-negative constraints and two equations:
exps = solve_non_negative(ZZMatrix, matrix(ZZ, [1 1 1 1 0 0 -4; 0 0 0 0 2 3 1]), matrix(ZZ, 2,1, [0,4]));
julia> nrows(exps)
1170
julia> exps[1:5,:]
[0 0 0 0 2 0 0]
[0 0 0 4 0 1 1]
[0 0 0 8 1 0 2]
[0 0 0 16 0 0 4]
[0 0 1 3 0 1 1]
Note that here we have x5^2
(instead of x6^2
) which fits to (0,4).
If you really wanted (0,6) you could easily change this and get the original number back:
julia> @time exps = solve_non_negative(ZZMatrix, matrix(ZZ, [1 1 1 1 0 0 -4; 0 0 0 0 2 3 1]), matrix(ZZ, 2,1, [0,6]));
0.182823 seconds (3.52 M allocations: 39.347 MiB, 12.41% gc time)
julia> nrows(exps)
4551
Feel free to add more examples.
@testset "quo for FPGroup"
intest/Groups/homomorphisms.jl
(the one after "Operations on homomorphism in Sym(6)"): https://github.com/oscar-system/Oscar.jl/actions/runs/5683498908/job/15404205753?pr=2598#step:6:577, https://github.com/oscar-system/Oscar.jl/actions/runs/5682985233/job/15402594964#step:6:1165, cc @ThomasBreuer
I think the process was killed due to an OOM situation. According to the timestamps in the log (can be turned on via the settings knob in the top right of that page) this was about 1.5 minutes after the
walls of chamber
testset was complete. (The total duration of that job is about 1 hour which is far away from any time limit). Unfortunately we can't get any more information from this failure.Maybe you can adapt the code to use a custom seeded rng, e.g.
Oscar.get_seeded_rng()
, then we might be able to reproduce the error if it happens again (the seed for this is fixed in the CI but one could easily run the test in a loop with different seeds to find bad examples).For the default julia rng knowing the seed would probably not really help as there might be an unknown + non-deterministic number of other uses which change the rng.
Yes it could be an OOM. I did 200 test runs on a server and each run took between 37-43.7 seconds. But why do we OOM? Does the bot have so little memory? Or do we leak and it piles up?
Concerning
@testset "quo for FPGroup"
intest/Groups/homomorphisms.jl
the tests that takes most of the reported 5.8 seconds are @test_throws ErrorException quo(h, [x^10])
and @test_throws ErrorException quo(h, n)
. Their aim is to check whether GAP still runs into an error in this situation; once this is no longer the case, some Oscar code may be changed. I do not know a faster way do do this.
There is really something bad going on with 1.10 and nightly, previously this did happen infrequently, but since a few days (at most one week) this is happening all the time.
I started a matrix job for 1.10 and nightly with GC logging and got this:
JIT: 62.053 MiB
Max. RSS: 6358.074 MiB
Free mem: 561.785 MiB
Free pmem: 561.785 MiB
Test Summary: | Pass Total Time
Experimental.gmodule | 10 10 31.7s
GC: pause 761.52ms. collected 2320.698616MB. incr
Heap stats: bytes_mapped 3840.94 MB, bytes_resident 2748.09 MB, heap_size 3517.94 MB, heap_target 540816867.94 MB, live_bytes 18379.10 MB
, Fragmentation 5.224Test Summary: | Pass Total Time
Experimental LocalH2 | 75 75 40.3s
[2046] signal (15): Terminated
in expression starting at /home/runner/work/_actions/julia-actions/julia-runtest/latest/test_harness.jl:7
Error: The operation was canceled.
Log: https://github.com/oscar-system/Oscar.jl/actions/runs/5689868672/job/15422159399#step:6:3171
There are plenty more failed runs in this job.
Regarding the stats: I have no idea how live_bytes 18379.10 MB
is possible since the runners have about 7GB of RAM, and the heap_target 552409759.83 MB
is quite suspicious...
I noticed one further thing in the GC output. In my first run (e.g this one) of these 16 jobs for every one of them the GC live_bytes
size jumps up at the end of the GAP OscarInterface tests, like this:
Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1286.89 MB, heap_size 1832.69 MB, heap_target 2357.69 MB, live_bytes 1761.48 MB
, Fragmentation 0.961GC: pause 898.46ms. collected 30.505936MB. incr
Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1286.89 MB, heap_size 1832.31 MB, heap_target 2357.31 MB, live_bytes 1778.76 MB
, Fragmentation 0.971GC: pause 320.62ms. collected 552.570595MB. incr
Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1317.08 MB, heap_size 2221.08 MB, heap_target 869387521.08 MB, live_bytes 1748.25 MB
, Fragmentation 0.787 39156 ms (1847 ms GC) and 392MB allocated for alnuth/polynome.tst
-----------------------------------
total 86501 ms (5572 ms GC) and 416MB allocated
0 failures in 4 files
#I No errors detected while testing
GC: pause 148.70ms. collected 258.622101MB. incr
Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1307.08 MB, heap_size 2238.83 MB, heap_target 2851.33 MB, live_bytes 3159.52 MB
, Fragmentation 1.411GC: pause 211.27ms. collected 104.489433MB. full recollect
Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1307.08 MB, heap_size 1900.67 MB, heap_target 2469.42 MB, live_bytes 9410.99 MB
, Fragmentation 4.951GC: pause 901.04ms. collected 7.217934MB. incr
Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1307.00 MB, heap_size 1900.50 MB, heap_target 2469.25 MB, live_bytes 9306.50 MB
, Fragmentation 4.897GC live: 9299.284 MiB
JIT: 19.680 MiB
Max. RSS: 3675.008 MiB
Free mem: 1472.082 MiB
Free pmem: 1472.082 MiB
GC live: 9299.346 MiB
JIT: 19.680 MiB
Max. RSS: 3675.008 MiB
Free mem: 1472.082 MiB
Free pmem: 1472.082 MiB
In some cases, like above, there is also the weird huge heap_target
value, often just once.
But disabling these test did not seem to help, the new job also failed for all 16 runs: https://github.com/oscar-system/Oscar.jl/actions/runs/5690757887
The GC uses some allocation heuristics and GC heuristics to decide the next target and they are usually pretty good, but it turns out sometimes something happens that makes them go crazy.
I've now submitted https://github.com/JuliaLang/julia/issues/50705 to ensure upstream is aware and can add this to the 1.10 milestone
Supposedly https://github.com/JuliaLang/julia/pull/50682 may help with this. We'll see!
I tried with https://github.com/JuliaLang/julia/pull/50682 and Hecke (which has the same problem). We still have the same problem, but only later. Right before the OOM killer did its job, I saw:
Heap stats: bytes_mapped 11330.77 MB, bytes_resident 4593.22 MB,
heap_size 17854.11 MB, heap_target 23454.11 MB, live_bytes 92905.01 MB
This is on a machine with 32 GB of memory (of which 21 were available). Not sure about those numbers, but how can 92 GB live byes be possible?
For the record, this happens on julia 1.6 as well. See https://github.com/oscar-system/Oscar.jl/actions/runs/5714074722/job/15480670050?pr=2609
For the record, this happens on julia 1.6 as well. See oscar-system/Oscar.jl/actions/runs/5714074722/job/15480670050?pr=2609
That test is disabled for now (https://github.com/oscar-system/Oscar.jl/pull/2579), it also happened on 1.8. But that was rarely in contrast to the constant failures on nightly and 1.10 way earlier in our testsuite.
@benlorenz Can you reproduce it locally (w/o https://github.com/JuliaLang/julia/pull/50682)?
I haven't really tested without that PR. I have run a bunch of tests with that PR merged and it reliably crashes when running it in a memory-constrained cgroup. A problem with that is that malloc behaves slightly different in such a cgroup than when the memory is really full (it doesn't return null
but instead the process will be killed if the limit is exceeded).
This causes it to crash even when setting a very low --heap-size-hint
value.
I have run various tests in the Github CI (7GB RAM on Ubuntu) to find a value for the heap size hint that works, maybe around 3.5GB (This is with plain nightly / 1.10.0-beta1). Even with such a low value the memory usage can grow too much for the runners causing the process to be killed. See e.g here https://github.com/oscar-system/Oscar.jl/actions/runs/5723920046/attempts/2 and https://github.com/oscar-system/Oscar.jl/actions/runs/5718082420/attempts/1 (the extra number at the end of the job is the heap size hint in GB).
For comparison: julia 1.9.2 runs through the whole testsuite (about 80 minutes) with 8GB heap-size-hint in a 8.5GB cgroup, while julia nightly does not succeed in a 10.5GB cgroup with a hint of 8GB.
I think there are several things contributing here:
jl_gc_get_max_memory
by default returns 2PB)2.445 GiB
for Sys.maxrss()
on 1.9.2 after the test, but 4.424 GiB
on nightly.live_bytes
) which happens with both Hecke and Oscar. [1][1]: The miscounting can be triggered with some GAP tests (this uses gcext, but Hecke seems to trigger this without gcext as well):
julia> using Test
julia> using Oscar
----- ----- ----- - -----
| | | | | | | | | |
| | | | | | | |
| | ----- | | | |-----
| | | | |-----| | |
| | | | | | | | | |
----- ----- ----- - - - -
...combining (and extending) ANTIC, GAP, Polymake and Singular
Version 0.13.0-DEV ...
... which comes with absolutely no warranty whatsoever
Type: '?Oscar' for more information
(c) 2019-2023 by The OSCAR Development Team
julia> GC.enable_logging(true)
julia> GC.gc(false); GC.gc(true)
GC: pause 187.85ms. collected 281.680916MB. incr
Heap stats: bytes_mapped 448.11 MB, bytes_resident 335.88 MB, heap_size 490.38 MB, heap_target 752.88 MB, live_bytes 341.70 MB
, Fragmentation 0.697GC: pause 86.89ms. collected 0.282654MB. full recollect
Heap stats: bytes_mapped 448.11 MB, bytes_resident 335.75 MB, heap_size 490.25 MB, heap_target 752.75 MB, live_bytes 382.11 MB
, Fragmentation 0.779GC: pause 386.24ms. collected 1.028664MB. incr
Heap stats: bytes_mapped 448.11 MB, bytes_resident 335.62 MB, heap_size 489.22 MB, heap_target 751.72 MB, live_bytes 381.95 MB
, Fragmentation 0.781
julia> @test GAP.Globals.TestDirectory(GAP.Globals.DirectoriesPackageLibrary(GAP.Obj("OscarInterface"), GAP.Obj("tst")))
Architecture: x86_64-pc-linux-gnu-julia1.11-64-kv8
testing: /home/lorenz/software/polymake/julia/Oscar.jl/src/../gap/OscarInterface//tst/alnuth/ALNUTH.tst
# line 1 of 86 (1%)GC: pause 85.86ms. collected 0.753052MB. full recollect
Heap stats: bytes_mapped 448.11 MB, bytes_resident 335.62 MB, heap_size 489.23 MB, heap_target 751.73 MB, live_bytes 380.92 MB
, Fragmentation 0.779GC: pause 368.91ms. collected 0.018028MB. incr
Heap stats: bytes_mapped 448.11 MB, bytes_resident 335.61 MB, heap_size 489.21 MB, heap_target 751.71 MB, live_bytes 380.97 MB
# line 18 of 86 (20%)GC: pause 99.35ms. collected 333.437424MB. incr
Heap stats: bytes_mapped 512.12 MB, bytes_resident 500.62 MB, heap_size 676.11 MB, heap_target 982.36 MB, live_bytes 380.95 MB
, Fragmentation 0.563GC: pause 126.04ms. collected 522.456375MB. incr
Heap stats: bytes_mapped 704.17 MB, bytes_resident 670.53 MB, heap_size 875.80 MB, heap_target 1313.30 MB, live_bytes 413.54 MB
, Fragmentation 0.472GC: pause 216.44ms. collected 754.326622MB. incr
Heap stats: bytes_mapped 1024.25 MB, bytes_resident 964.05 MB, heap_size 1164.24 MB, heap_target 1689.24 MB, live_bytes 454.10 MB
# line 81 of 86 (94%)GC: pause 230.63ms. collected 1098.841599MB. incr
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 1126.45 MB, heap_size 1333.83 MB, heap_target 1683.83 MB, live_bytes 551.80 MB
44871 ms (1092 ms GC) and 11.2MB allocated for alnuth/ALNUTH.tst
testing: /home/lorenz/software/polymake/julia/Oscar.jl/src/../gap/OscarInterface//tst/alnuth/examples.t\
st
# line 1 of 271 (0%)GC: pause 146.92ms. collected 81.539536MB. full recollect
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 1102.11 MB, heap_size 1307.03 MB, heap_target 1744.53 MB, live_bytes 585.39 MB
, Fragmentation 0.448GC: pause 390.70ms. collected 155.567131MB. incr
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 803.80 MB, heap_size 994.14 MB, heap_target 1387.89 MB, live_bytes 574.64 MB
6042 ms (561 ms GC) and 15.3MB allocated for alnuth/examples.tst
testing: /home/lorenz/software/polymake/julia/Oscar.jl/src/../gap/OscarInterface//tst/alnuth/manual.tst
# line 1 of 64 (1%)GC: pause 160.84ms. collected 188.842827MB. full recollect
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 787.28 MB, heap_size 1011.43 MB, heap_target 1405.18 MB, live_bytes 419.07 MB
, Fragmentation 0.414GC: pause 356.13ms. collected 0.587646MB. incr
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 787.28 MB, heap_size 1010.73 MB, heap_target 1404.48 MB, live_bytes 479.72 MB
818 ms (592 ms GC) and 3.63MB allocated for alnuth/manual.tst
testing: /home/lorenz/software/polymake/julia/Oscar.jl/src/../gap/OscarInterface//tst/alnuth/polynome.t\
st
# line 2 of 47 (4%)GC: pause 122.30ms. collected 23.319710MB. full recollect
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 787.28 MB, heap_size 1008.29 MB, heap_target 1402.04 MB, live_bytes 479.13 MB
, Fragmentation 0.475GC: pause 334.29ms. collected 32.628418MB. incr
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 787.28 MB, heap_size 1002.85 MB, heap_target 1396.60 MB, live_bytes 499.14 MB
# line 16 of 47 (34%)GC: pause 193.51ms. collected 294.139137MB. incr
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 791.02 MB, heap_size 1299.46 MB, heap_target 1736.96 MB, live_bytes 466.51 MB
40842 ms (925 ms GC) and 392MB allocated for alnuth/polynome.tst
-----------------------------------
total 92573 ms (3170 ms GC) and 423MB allocated
0 failures in 4 files
#I No errors detected while testing
GC: pause 109.34ms. collected 542.099785MB. incr
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 804.59 MB, heap_size 1525.13 MB, heap_target 1787.63 MB, live_bytes 1611.86 MB
, Fragmentation 1.057Test Passed
julia> GC.gc(false); GC.gc(true)
GC: pause 25.73ms. collected 29.697510MB. incr
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 785.30 MB, heap_size 1092.44 MB, heap_target 1486.19 MB, live_bytes 8130.76 MB
, Fragmentation 7.443GC: pause 71.46ms. collected 76.772232MB. full recollect
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 785.30 MB, heap_size 1092.44 MB, heap_target 1486.19 MB, live_bytes 8120.73 MB
, Fragmentation 7.434GC: pause 282.94ms. collected 7.942421MB. incr
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 784.48 MB, heap_size 1091.55 MB, heap_target 1485.30 MB, live_bytes 8043.96 MB
, Fragmentation 7.369
Note that during the GC call after the tests were done the live_bytes
size jumps to 8GB.
I can't reproduce that, which is really odd.
I can't reproduce that, which is really odd.
The live_bytes
issue? Maybe try it like this:
$ julia-nightly
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.11.0-DEV.203 (2023-07-31)
_/ |\__'_|_|_|\__'_| | Commit ec8df3da359 (1 day old master)
|__/ |
(@v1.11) pkg> activate --temp
Activating new project at `/tmp/jl_GThsJo`
(jl_GThsJo) pkg> add Oscar#master
<....>
julia> using Oscar; using Test;
<....>
julia> GC.enable_logging(true)
julia> @test GAP.Globals.TestDirectory(GAP.Globals.DirectoriesPackageLibrary(GAP.Obj("OscarInterface"), GAP.Obj("tst")))
<....>
Test Passed
julia> GC.gc(false); GC.gc(true)
GC: pause 75.32ms. collected 554.220963MB. incr
Heap stats: bytes_mapped 2176.53 MB, bytes_resident 886.66 MB, heap_size 1578.59 MB, heap_target 2103.59 MB, live_bytes 1634.71 MB
, Fragmentation 1.036GC: pause 69.74ms. collected 93.646736MB. full recollect
Heap stats: bytes_mapped 2176.53 MB, bytes_resident 875.67 MB, heap_size 1162.12 MB, heap_target 1599.62 MB, live_bytes 8104.16 MB
, Fragmentation 6.974GC: pause 221.34ms. collected 7.917084MB. incr
Heap stats: bytes_mapped 2176.53 MB, bytes_resident 874.69 MB, heap_size 1161.03 MB, heap_target 1598.53 MB, live_bytes 8010.67 MB
, Fragmentation 6.900
This happens reliably on two (very) different Linux machines with an official nightly binary (and also with a completely fresh julia depot).
I will try some debugging.
So it seems there are a couple of issues. I fixed a bug on the PR but it still seems that our live_bytes
tracking is no longer accurate (it doesn't actually affect any of the GC decisions on 1.10). live_bytes should never be greater than heap_size because it's bounded by it. At the same time I also realized that heap-size-hint was broken. Hopefully I fixed it in the last commit. It seems to get more and more wrong over time. We are potentially either double counting allocations or missing frees in the accounting.
It also has some weird things where it potentially doesn't apply to children processes https://github.com/JuliaLang/julia/issues/50673
Unfortunately, https://github.com/JuliaLang/julia/pull/50682 being present in julia nightly didn't fix this issue. Any idea on how to proceed?
We should mention this over at https://github.com/JuliaLang/julia/issues/50705
I'll play with it a bit more, see if I can figure something out.
Did you run on github action or with a local cgroup, because it does seem that the testset is very close to needing 7GB, at least when running locally. And from looking at it does seem like the GC is running, at least locally. It's just not able to free stuff. @benlorenz could you run the tests with your cgroups again?
What's the status of this? Have the changes in Julia fixed this? Or did we end up disabling some tests (which?) to workaround?
In other words: does this need to stay open? If so, it would be good to have an explicit pointer to whatever disabled tests can be used to repro the issue.
There are no disabled tests. I think several things improved this:
I don't think this needs to stay open.
Describe the bug In the past days, I noticed the test-ubuntu-1.8 job to terminated unexpectedly mid-way. A restart of the test did always succeed.
See e.g. https://github.com/oscar-system/Oscar.jl/actions/runs/5145822885/jobs/9263999579 and https://github.com/oscar-system/Oscar.jl/actions/runs/5133045620/jobs/9235131573