oscar-system / Oscar.jl

A comprehensive open source computer algebra system for computations in algebra, geometry, and number theory.
https://www.oscar-system.org
Other
344 stars 126 forks source link

CI failures for ubuntu on 1.10 and nightly, probably out-of-memory issues #2441

Closed lgoettgens closed 6 months ago

lgoettgens commented 1 year ago

Describe the bug In the past days, I noticed the test-ubuntu-1.8 job to terminated unexpectedly mid-way. A restart of the test did always succeed.

See e.g. https://github.com/oscar-system/Oscar.jl/actions/runs/5145822885/jobs/9263999579 and https://github.com/oscar-system/Oscar.jl/actions/runs/5133045620/jobs/9235131573

thofma commented 1 year ago

After "wall of chambers" comes

Test Summary:                  | Pass  Total
K3 surface automorphism groups |   15     15
-> Testing ./AlgebraicGeometry/Surfaces/K3Auto.jl took: 154.803 seconds, 31089.35 MB

(copied from a different run), so I am not too surprised that this might get killed. Same for the second log you posted, where the next test would be

Test Summary:                                                       | Pass  Total   Time
Line bundle cohomologies and vanishing sets (set_attributes = true) |   30     30  27.9s
Test Summary:                                                        | Pass  Total   Time
Line bundle cohomologies and vanishing sets (set_attributes = false) |   30     30  24.3s
-> Testing AlgebraicGeometry/ToricVarieties/line_bundle_cohomologies.jl took: runtime 50.026 seconds + compilation 7.299 seconds + recompilation 0.000 seconds, 12774.31 MB
benlorenz commented 1 year ago

See #2570 for some more examples of this, this does not only happen with julia 1.8 but also 1.6 and 1.10.

So far we have seen this happen:

Feel free to add more examples.

simonbrandhorst commented 1 year ago

Is it possible to get more information? Like did the process OOM or did it take too long? There is some randomization involved in K3Auto.jl. I am wondering where it fails exactly. Maybe there is a bug somewhere. Can this be reproduced using some known random seed?

simonbrandhorst commented 1 year ago

I can try to get the runtime of the test faster. But I think that more of 50% of it is already compilation time. So how to proceed?

benlorenz commented 1 year ago

I think the process was killed due to an OOM situation. According to the timestamps in the log (can be turned on via the settings knob in the top right of that page) this was about 1.5 minutes after the walls of chamber testset was complete. (The total duration of that job is about 1 hour which is far away from any time limit). Unfortunately we can't get any more information from this failure.

Maybe you can adapt the code to use a custom seeded rng, e.g. Oscar.get_seeded_rng(), then we might be able to reproduce the error if it happens again (the seed for this is fixed in the CI but one could easily run the test in a loop with different seeds to find bad examples).

For the default julia rng knowing the seed would probably not really help as there might be an unknown + non-deterministic number of other uses which change the rng.

HereAround commented 1 year ago

I cannot say much towards the detailed cause or a possible resolution. However, I can share a few insights onto AlgebraicGeometry/ToricVarieties/line_bundle_cohomologies.jl, which seems to be the place where this segfault is triggered quite often. (In fact, I experienced this myself the other day and considered it a fluke. At that time however, I was not aware of this issue yet.)

As said, I am not sure about the cause. But maybe this information helps @benlorenz to guess/track down the cause for this?

benlorenz commented 1 year ago

The original report in this ticket was from the first of June, the last change in the line_bundle_cohomologies test file was just a week earlier: https://github.com/oscar-system/Oscar.jl/pull/2396

Please note that this is not a segfault but the process seems to run out of memory. Since julia reports about 12GB of allocations for that test-file the memory-intensive code seems to use julia memory allocations, i.e. it is probably not cohomCalg. A good start might be to check which of the test-groups in that file need a lot of allocations, e.g. using @time?

benlorenz commented 1 year ago

Another new case:

-> Testing AlgebraicGeometry/Schemes/CoherentSheaves.jl took: 78.817 seconds, 5119.89 MB

signal (15): Terminated
in expression starting at none:1
epoll_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)

from the OscarCI tests in GAP: https://github.com/oscar-system/GAP.jl/actions/runs/5608123338/jobs/10260113884#step:7:1471

The next testgroup after CoherentSheaves.jl would be elliptic_surface.jl which also needs quite a lot of memory allocations:

-> Testing AlgebraicGeometry/Schemes/CoherentSheaves.jl took: runtime 11.783 seconds + compilation 74.945 seconds + recompilation 0.000 seconds, 3271.67 MB
Test Summary:     | Pass  Total     Time
elliptic surfaces |    2      2  5m10.8s
-> Testing AlgebraicGeometry/Schemes/elliptic_surface.jl took: runtime 151.211 seconds + compilation 159.639 seconds + recompilation 0.000 seconds, 15826.34 MB
HereAround commented 1 year ago

@benlorenz I stand corrected. Thank you for reminding me of this.

I could narrow this down to a single line. This line would compute all monomials of a certain degree in a certain graded ring (a.k.a. a homogeneous_component). I executed AlgebraicGeometry/ToricVarieties/line_bundle_cohomologies.jl with @time. Here is what I find:

So a fix could be to just remove this line from the tests. A corresponding PR is here: https://github.com/oscar-system/Oscar.jl/pull/2579.

Alternatively, we could of course dive deeper and see if the homogeneous component computation could be improved. If the latter is interesting, it can be reproduced as follows:

julia> ray_generators = [[1, 0, 0,-2,-3], [0, 1, 0,-2,-3], [0, 0, 1,-2,-3], [-1,-1,-1,-2,-3], [0, 0, 0, 1, 0], [0, 0, 0, 0, 1], [0, 0, 0,-2,-3]];

julia> max_cones = [[1, 2, 3, 5, 6], [1, 2, 3, 5, 7], [1, 2, 3, 6, 7], [2, 3, 4, 5, 6], [2, 3, 4, 5, 7], [2, 3, 4, 6, 7], [1, 3, 4, 5, 6], [1, 3, 4, 5, 7], [1, 3, 4, 6, 7], [1, 2, 4, 5, 6], [1, 2, 4, 5, 7], [1, 2, 4, 6, 7]];

julia> weierstrass_over_p3 = normal_toric_variety(ray_generators, max_cones; non_redundant = true);

julia> R = cox_ring(weierstrass_over_p3)
Multivariate polynomial ring in 7 variables over QQ graded by 
  x1 -> [1 0]
  x2 -> [1 0]
  x3 -> [1 0]
  x4 -> [1 0]
  x5 -> [0 2]
  x6 -> [0 3]
  x7 -> [-4 1]

Of this ring R, I wish to compute all monomials of degree (0,4). This is a vector space of dimension 4551. Specifically, for this, I would have executed the following commands:

julia> l4 = anticanonical_bundle(weierstrass_over_p3);

julia> b = basis_of_global_sections(l4);

julia> length(b) == 4551
true

Let me cc @wdecker regarding the computation of homogeneous components, just in case.

simonbrandhorst commented 1 year ago

The elliptic surface stuff is a really complex data structure and computation. We worked hard to get this test running in 2 minutes. I am afraid that the test cannot be simplified much further. Plus it takes more time for compilation than for computation right now. Not sure what can be done about the allocations though. There is probably room for improvement all over the place.

benlorenz commented 1 year ago

Alternatively, we could of course dive deeper and see if the homogeneous component computation could be improved. If the latter is interesting, it can be reproduced as follows:

julia> ray_generators = [[1, 0, 0,-2,-3], [0, 1, 0,-2,-3], [0, 0, 1,-2,-3], [-1,-1,-1,-2,-3], [0, 0, 0, 1, 0], [0, 0, 0, 0, 1], [0, 0, 0,-2,-3]];

julia> max_cones = [[1, 2, 3, 5, 6], [1, 2, 3, 5, 7], [1, 2, 3, 6, 7], [2, 3, 4, 5, 6], [2, 3, 4, 5, 7], [2, 3, 4, 6, 7], [1, 3, 4, 5, 6], [1, 3, 4, 5, 7], [1, 3, 4, 6, 7], [1, 2, 4, 5, 6], [1, 2, 4, 5, 7], [1, 2, 4, 6, 7]];

julia> weierstrass_over_p3 = normal_toric_variety(ray_generators, max_cones; non_redundant = true);

julia> R = cox_ring(weierstrass_over_p3)
Multivariate polynomial ring in 7 variables over QQ graded by 
  x1 -> [1 0]
  x2 -> [1 0]
  x3 -> [1 0]
  x4 -> [1 0]
  x5 -> [0 2]
  x6 -> [0 3]
  x7 -> [-4 1]

Of this ring R, I wish to compute all monomials of degree (0,4). This is a vector space of dimension 4551. Specifically, for this, I would have executed the following commands:

julia> l4 = anticanonical_bundle(weierstrass_over_p3);

julia> b = basis_of_global_sections(l4);

julia> length(b) == 4551
true
julia> b[1:5]
5-element Vector{MPolyDecRingElem{QQFieldElem, QQMPolyRingElem}}:
 x6^2
 x5^3
 x4^4*x5*x6*x7
 x4^8*x5^2*x7^2
 x4^12*x6*x7^3

This looks like degree (0,6) to me?

To me this feels like integer points in a polyhedron given by some non-negative constraints and two equations:

exps = solve_non_negative(ZZMatrix, matrix(ZZ, [1 1 1 1 0 0 -4; 0 0 0 0 2 3 1]), matrix(ZZ, 2,1, [0,4]));

julia> nrows(exps)
1170

julia> exps[1:5,:]
[0   0   0    0   2   0   0]
[0   0   0    4   0   1   1]
[0   0   0    8   1   0   2]
[0   0   0   16   0   0   4]
[0   0   1    3   0   1   1]

Note that here we have x5^2 (instead of x6^2) which fits to (0,4).

If you really wanted (0,6) you could easily change this and get the original number back:

julia> @time exps = solve_non_negative(ZZMatrix, matrix(ZZ, [1 1 1 1 0 0 -4; 0 0 0 0 2 3 1]), matrix(ZZ, 2,1, [0,6]));
  0.182823 seconds (3.52 M allocations: 39.347 MiB, 12.41% gc time)

julia> nrows(exps)
4551
lgoettgens commented 1 year ago

Feel free to add more examples.

simonbrandhorst commented 1 year ago

I think the process was killed due to an OOM situation. According to the timestamps in the log (can be turned on via the settings knob in the top right of that page) this was about 1.5 minutes after the walls of chamber testset was complete. (The total duration of that job is about 1 hour which is far away from any time limit). Unfortunately we can't get any more information from this failure.

Maybe you can adapt the code to use a custom seeded rng, e.g. Oscar.get_seeded_rng(), then we might be able to reproduce the error if it happens again (the seed for this is fixed in the CI but one could easily run the test in a loop with different seeds to find bad examples).

For the default julia rng knowing the seed would probably not really help as there might be an unknown + non-deterministic number of other uses which change the rng.

Yes it could be an OOM. I did 200 test runs on a server and each run took between 37-43.7 seconds. But why do we OOM? Does the bot have so little memory? Or do we leak and it piles up?

ThomasBreuer commented 1 year ago

Concerning

@testset "quo for FPGroup" in test/Groups/homomorphisms.jl

the tests that takes most of the reported 5.8 seconds are @test_throws ErrorException quo(h, [x^10]) and @test_throws ErrorException quo(h, n). Their aim is to check whether GAP still runs into an error in this situation; once this is no longer the case, some Oscar code may be changed. I do not know a faster way do do this.

benlorenz commented 1 year ago

There is really something bad going on with 1.10 and nightly, previously this did happen infrequently, but since a few days (at most one week) this is happening all the time.

I started a matrix job for 1.10 and nightly with GC logging and got this:

JIT:          62.053 MiB
Max. RSS:   6358.074 MiB
Free mem:    561.785 MiB
Free pmem:   561.785 MiB
Test Summary:        | Pass  Total   Time
Experimental.gmodule |   10     10  31.7s
GC: pause 761.52ms. collected 2320.698616MB. incr 
Heap stats: bytes_mapped 3840.94 MB, bytes_resident 2748.09 MB, heap_size 3517.94 MB, heap_target 540816867.94 MB, live_bytes 18379.10 MB
, Fragmentation 5.224Test Summary:        | Pass  Total   Time
Experimental LocalH2 |   75     75  40.3s

[2046] signal (15): Terminated
in expression starting at /home/runner/work/_actions/julia-actions/julia-runtest/latest/test_harness.jl:7
Error: The operation was canceled.

Log: https://github.com/oscar-system/Oscar.jl/actions/runs/5689868672/job/15422159399#step:6:3171

There are plenty more failed runs in this job. Regarding the stats: I have no idea how live_bytes 18379.10 MB is possible since the runners have about 7GB of RAM, and the heap_target 552409759.83 MB is quite suspicious...

benlorenz commented 1 year ago

I noticed one further thing in the GC output. In my first run (e.g this one) of these 16 jobs for every one of them the GC live_bytes size jumps up at the end of the GAP OscarInterface tests, like this:

Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1286.89 MB, heap_size 1832.69 MB, heap_target 2357.69 MB, live_bytes 1761.48 MB
, Fragmentation 0.961GC: pause 898.46ms. collected 30.505936MB. incr 
Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1286.89 MB, heap_size 1832.31 MB, heap_target 2357.31 MB, live_bytes 1778.76 MB
, Fragmentation 0.971GC: pause 320.62ms. collected 552.570595MB. incr 
Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1317.08 MB, heap_size 2221.08 MB, heap_target 869387521.08 MB, live_bytes 1748.25 MB
, Fragmentation 0.787   39156 ms (1847 ms GC) and 392MB allocated for alnuth/polynome.tst
-----------------------------------
total     86501 ms (5572 ms GC) and 416MB allocated
              0 failures in 4 files

#I  No errors detected while testing

GC: pause 148.70ms. collected 258.622101MB. incr 
Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1307.08 MB, heap_size 2238.83 MB, heap_target 2851.33 MB, live_bytes 3159.52 MB
, Fragmentation 1.411GC: pause 211.27ms. collected 104.489433MB. full recollect
Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1307.08 MB, heap_size 1900.67 MB, heap_target 2469.42 MB, live_bytes 9410.99 MB
, Fragmentation 4.951GC: pause 901.04ms. collected 7.217934MB. incr 
Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1307.00 MB, heap_size 1900.50 MB, heap_target 2469.25 MB, live_bytes 9306.50 MB
, Fragmentation 4.897GC live:    9299.284 MiB
JIT:          19.680 MiB
Max. RSS:   3675.008 MiB
Free mem:   1472.082 MiB
Free pmem:  1472.082 MiB
GC live:    9299.346 MiB
JIT:          19.680 MiB
Max. RSS:   3675.008 MiB
Free mem:   1472.082 MiB
Free pmem:  1472.082 MiB

In some cases, like above, there is also the weird huge heap_target value, often just once.

But disabling these test did not seem to help, the new job also failed for all 16 runs: https://github.com/oscar-system/Oscar.jl/actions/runs/5690757887

gbaraldi commented 1 year ago

The GC uses some allocation heuristics and GC heuristics to decide the next target and they are usually pretty good, but it turns out sometimes something happens that makes them go crazy.

fingolfin commented 1 year ago

I've now submitted https://github.com/JuliaLang/julia/issues/50705 to ensure upstream is aware and can add this to the 1.10 milestone

fingolfin commented 1 year ago

Supposedly https://github.com/JuliaLang/julia/pull/50682 may help with this. We'll see!

thofma commented 1 year ago

I tried with https://github.com/JuliaLang/julia/pull/50682 and Hecke (which has the same problem). We still have the same problem, but only later. Right before the OOM killer did its job, I saw:

Heap stats: bytes_mapped 11330.77 MB, bytes_resident 4593.22 MB,
heap_size 17854.11 MB, heap_target 23454.11 MB, live_bytes 92905.01 MB

This is on a machine with 32 GB of memory (of which 21 were available). Not sure about those numbers, but how can 92 GB live byes be possible?

lgoettgens commented 1 year ago

For the record, this happens on julia 1.6 as well. See https://github.com/oscar-system/Oscar.jl/actions/runs/5714074722/job/15480670050?pr=2609

benlorenz commented 1 year ago

For the record, this happens on julia 1.6 as well. See oscar-system/Oscar.jl/actions/runs/5714074722/job/15480670050?pr=2609

That test is disabled for now (https://github.com/oscar-system/Oscar.jl/pull/2579), it also happened on 1.8. But that was rarely in contrast to the constant failures on nightly and 1.10 way earlier in our testsuite.

thofma commented 1 year ago

@benlorenz Can you reproduce it locally (w/o https://github.com/JuliaLang/julia/pull/50682)?

benlorenz commented 1 year ago

I haven't really tested without that PR. I have run a bunch of tests with that PR merged and it reliably crashes when running it in a memory-constrained cgroup. A problem with that is that malloc behaves slightly different in such a cgroup than when the memory is really full (it doesn't return null but instead the process will be killed if the limit is exceeded). This causes it to crash even when setting a very low --heap-size-hint value.

I have run various tests in the Github CI (7GB RAM on Ubuntu) to find a value for the heap size hint that works, maybe around 3.5GB (This is with plain nightly / 1.10.0-beta1). Even with such a low value the memory usage can grow too much for the runners causing the process to be killed. See e.g here https://github.com/oscar-system/Oscar.jl/actions/runs/5723920046/attempts/2 and https://github.com/oscar-system/Oscar.jl/actions/runs/5718082420/attempts/1 (the extra number at the end of the job is the heap size hint in GB).

For comparison: julia 1.9.2 runs through the whole testsuite (about 80 minutes) with 8GB heap-size-hint in a 8.5GB cgroup, while julia nightly does not succeed in a 10.5GB cgroup with a hint of 8GB.

I think there are several things contributing here:

[1]: The miscounting can be triggered with some GAP tests (this uses gcext, but Hecke seems to trigger this without gcext as well):

julia> using Test

julia> using Oscar
 -----    -----    -----      -      -----   
|     |  |     |  |     |    | |    |     |  
|     |  |        |         |   |   |     |  
|     |   -----   |        |     |  |-----   
|     |        |  |        |-----|  |   |    
|     |  |     |  |     |  |     |  |    |   
 -----    -----    -----   -     -  -     -  

...combining (and extending) ANTIC, GAP, Polymake and Singular
Version 0.13.0-DEV ... 
 ... which comes with absolutely no warranty whatsoever
Type: '?Oscar' for more information
(c) 2019-2023 by The OSCAR Development Team

julia> GC.enable_logging(true)

julia> GC.gc(false); GC.gc(true)
GC: pause 187.85ms. collected 281.680916MB. incr 
Heap stats: bytes_mapped 448.11 MB, bytes_resident 335.88 MB, heap_size 490.38 MB, heap_target 752.88 MB, live_bytes 341.70 MB
, Fragmentation 0.697GC: pause 86.89ms. collected 0.282654MB. full recollect
Heap stats: bytes_mapped 448.11 MB, bytes_resident 335.75 MB, heap_size 490.25 MB, heap_target 752.75 MB, live_bytes 382.11 MB
, Fragmentation 0.779GC: pause 386.24ms. collected 1.028664MB. incr 
Heap stats: bytes_mapped 448.11 MB, bytes_resident 335.62 MB, heap_size 489.22 MB, heap_target 751.72 MB, live_bytes 381.95 MB
, Fragmentation 0.781
julia> @test GAP.Globals.TestDirectory(GAP.Globals.DirectoriesPackageLibrary(GAP.Obj("OscarInterface"), GAP.Obj("tst")))
Architecture: x86_64-pc-linux-gnu-julia1.11-64-kv8

testing: /home/lorenz/software/polymake/julia/Oscar.jl/src/../gap/OscarInterface//tst/alnuth/ALNUTH.tst
# line 1 of 86 (1%)GC: pause 85.86ms. collected 0.753052MB. full recollect
Heap stats: bytes_mapped 448.11 MB, bytes_resident 335.62 MB, heap_size 489.23 MB, heap_target 751.73 MB, live_bytes 380.92 MB
, Fragmentation 0.779GC: pause 368.91ms. collected 0.018028MB. incr 
Heap stats: bytes_mapped 448.11 MB, bytes_resident 335.61 MB, heap_size 489.21 MB, heap_target 751.71 MB, live_bytes 380.97 MB
# line 18 of 86 (20%)GC: pause 99.35ms. collected 333.437424MB. incr 
Heap stats: bytes_mapped 512.12 MB, bytes_resident 500.62 MB, heap_size 676.11 MB, heap_target 982.36 MB, live_bytes 380.95 MB
, Fragmentation 0.563GC: pause 126.04ms. collected 522.456375MB. incr 
Heap stats: bytes_mapped 704.17 MB, bytes_resident 670.53 MB, heap_size 875.80 MB, heap_target 1313.30 MB, live_bytes 413.54 MB
, Fragmentation 0.472GC: pause 216.44ms. collected 754.326622MB. incr 
Heap stats: bytes_mapped 1024.25 MB, bytes_resident 964.05 MB, heap_size 1164.24 MB, heap_target 1689.24 MB, live_bytes 454.10 MB
# line 81 of 86 (94%)GC: pause 230.63ms. collected 1098.841599MB. incr 
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 1126.45 MB, heap_size 1333.83 MB, heap_target 1683.83 MB, live_bytes 551.80 MB
   44871 ms (1092 ms GC) and 11.2MB allocated for alnuth/ALNUTH.tst
testing: /home/lorenz/software/polymake/julia/Oscar.jl/src/../gap/OscarInterface//tst/alnuth/examples.t\
st
# line 1 of 271 (0%)GC: pause 146.92ms. collected 81.539536MB. full recollect
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 1102.11 MB, heap_size 1307.03 MB, heap_target 1744.53 MB, live_bytes 585.39 MB
, Fragmentation 0.448GC: pause 390.70ms. collected 155.567131MB. incr 
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 803.80 MB, heap_size 994.14 MB, heap_target 1387.89 MB, live_bytes 574.64 MB
    6042 ms (561 ms GC) and 15.3MB allocated for alnuth/examples.tst
testing: /home/lorenz/software/polymake/julia/Oscar.jl/src/../gap/OscarInterface//tst/alnuth/manual.tst
# line 1 of 64 (1%)GC: pause 160.84ms. collected 188.842827MB. full recollect
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 787.28 MB, heap_size 1011.43 MB, heap_target 1405.18 MB, live_bytes 419.07 MB
, Fragmentation 0.414GC: pause 356.13ms. collected 0.587646MB. incr 
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 787.28 MB, heap_size 1010.73 MB, heap_target 1404.48 MB, live_bytes 479.72 MB
     818 ms (592 ms GC) and 3.63MB allocated for alnuth/manual.tst
testing: /home/lorenz/software/polymake/julia/Oscar.jl/src/../gap/OscarInterface//tst/alnuth/polynome.t\
st
# line 2 of 47 (4%)GC: pause 122.30ms. collected 23.319710MB. full recollect
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 787.28 MB, heap_size 1008.29 MB, heap_target 1402.04 MB, live_bytes 479.13 MB
, Fragmentation 0.475GC: pause 334.29ms. collected 32.628418MB. incr 
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 787.28 MB, heap_size 1002.85 MB, heap_target 1396.60 MB, live_bytes 499.14 MB
# line 16 of 47 (34%)GC: pause 193.51ms. collected 294.139137MB. incr 
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 791.02 MB, heap_size 1299.46 MB, heap_target 1736.96 MB, live_bytes 466.51 MB
   40842 ms (925 ms GC) and 392MB allocated for alnuth/polynome.tst
-----------------------------------
total     92573 ms (3170 ms GC) and 423MB allocated
              0 failures in 4 files

#I  No errors detected while testing

GC: pause 109.34ms. collected 542.099785MB. incr 
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 804.59 MB, heap_size 1525.13 MB, heap_target 1787.63 MB, live_bytes 1611.86 MB
, Fragmentation 1.057Test Passed

julia> GC.gc(false); GC.gc(true)
GC: pause 25.73ms. collected 29.697510MB. incr 
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 785.30 MB, heap_size 1092.44 MB, heap_target 1486.19 MB, live_bytes 8130.76 MB
, Fragmentation 7.443GC: pause 71.46ms. collected 76.772232MB. full recollect
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 785.30 MB, heap_size 1092.44 MB, heap_target 1486.19 MB, live_bytes 8120.73 MB
, Fragmentation 7.434GC: pause 282.94ms. collected 7.942421MB. incr 
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 784.48 MB, heap_size 1091.55 MB, heap_target 1485.30 MB, live_bytes 8043.96 MB
, Fragmentation 7.369

Note that during the GC call after the tests were done the live_bytes size jumps to 8GB.

gbaraldi commented 1 year ago

I can't reproduce that, which is really odd.

benlorenz commented 1 year ago

I can't reproduce that, which is really odd.

The live_bytes issue? Maybe try it like this:

$ julia-nightly
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.11.0-DEV.203 (2023-07-31)
 _/ |\__'_|_|_|\__'_|  |  Commit ec8df3da359 (1 day old master)
|__/                   |

(@v1.11) pkg> activate --temp
  Activating new project at `/tmp/jl_GThsJo`

(jl_GThsJo) pkg> add Oscar#master
<....>
julia> using Oscar; using Test;
<....>
julia> GC.enable_logging(true)

julia> @test GAP.Globals.TestDirectory(GAP.Globals.DirectoriesPackageLibrary(GAP.Obj("OscarInterface"), GAP.Obj("tst")))
<....>

Test Passed

julia> GC.gc(false); GC.gc(true)
GC: pause 75.32ms. collected 554.220963MB. incr
Heap stats: bytes_mapped 2176.53 MB, bytes_resident 886.66 MB, heap_size 1578.59 MB, heap_target 2103.59 MB, live_bytes 1634.71 MB
, Fragmentation 1.036GC: pause 69.74ms. collected 93.646736MB. full recollect
Heap stats: bytes_mapped 2176.53 MB, bytes_resident 875.67 MB, heap_size 1162.12 MB, heap_target 1599.62 MB, live_bytes 8104.16 MB
, Fragmentation 6.974GC: pause 221.34ms. collected 7.917084MB. incr
Heap stats: bytes_mapped 2176.53 MB, bytes_resident 874.69 MB, heap_size 1161.03 MB, heap_target 1598.53 MB, live_bytes 8010.67 MB
, Fragmentation 6.900

This happens reliably on two (very) different Linux machines with an official nightly binary (and also with a completely fresh julia depot).

I will try some debugging.

gbaraldi commented 1 year ago

So it seems there are a couple of issues. I fixed a bug on the PR but it still seems that our live_bytes tracking is no longer accurate (it doesn't actually affect any of the GC decisions on 1.10). live_bytes should never be greater than heap_size because it's bounded by it. At the same time I also realized that heap-size-hint was broken. Hopefully I fixed it in the last commit. It seems to get more and more wrong over time. We are potentially either double counting allocations or missing frees in the accounting. It also has some weird things where it potentially doesn't apply to children processes https://github.com/JuliaLang/julia/issues/50673

lgoettgens commented 1 year ago

Unfortunately, https://github.com/JuliaLang/julia/pull/50682 being present in julia nightly didn't fix this issue. Any idea on how to proceed?

thofma commented 1 year ago

We should mention this over at https://github.com/JuliaLang/julia/issues/50705

gbaraldi commented 1 year ago

I'll play with it a bit more, see if I can figure something out.

gbaraldi commented 1 year ago

Did you run on github action or with a local cgroup, because it does seem that the testset is very close to needing 7GB, at least when running locally. And from looking at it does seem like the GC is running, at least locally. It's just not able to free stuff. @benlorenz could you run the tests with your cgroups again?

fingolfin commented 6 months ago

What's the status of this? Have the changes in Julia fixed this? Or did we end up disabling some tests (which?) to workaround?

In other words: does this need to stay open? If so, it would be good to have an explicit pointer to whatever disabled tests can be used to repro the issue.

benlorenz commented 6 months ago

There are no disabled tests. I think several things improved this:

I don't think this needs to stay open.