Enable Boehm memory garbage collection

GrahamDennis commented 2 months ago

Nix uses the conservative memory garbage collector Boehm, however this was previously explicitly disabled in nix-eval-jobs:

/* We are doing the garbage collection by killing forks */
setenv("GC_DONT_GC", "1", 1);

The commit message for this change suggests that it wasn't an intentional decision to disable GC, but instead a workaround to resolve Boehm not knowing about the threads nix-eval-jobs was using and printing "Collecting from unknown thread" messages.

This PR explicitly registers the worker threads with Boehm, which resolves the original issue.

By enabling memory garbage collection, workers will use less peak memory during evaluation allowing them to live longer and re-use a larger number of prior evaluations prior to being reaped.

GrahamDennis commented 2 months ago

@Mic92 : I can't see many contributed PR's, so I don't know what the expected workflow is. So I'm pinging you so you know there's something to review when you have time :-)

Mic92 commented 2 months ago

The operating system is actually a fast garbage collector itself also very coarse grained in this case. Have you done benchmarks where we can see the difference in performance for real world projects i.e. nixpkgs?

GrahamDennis commented 2 months ago

@Mic92 : If you prefer, I'm happy to put this behind an option. The motivation for me was that when evaluating NixOS systems, memory usage would grow substantially, particularly for systems with specialisations. Before this change, evaluating NixOS modules was consuming 20GB of RAM and after a single evaluation was around 8GB. This meant that using nix-eval-jobs did not also cause evaluation to use more RAM, and if we set the worker maximum memory usage above the default, we still get evaluation caching.

Mic92 commented 2 months ago

@Mic92 : If you prefer, I'm happy to put this behind an option. The motivation for me was that when evaluating NixOS systems, memory usage would grow substantially, particularly for systems with specialisations. Before this change, evaluating NixOS modules was consuming 20GB of RAM and after a single evaluation was around 8GB. This meant that using nix-eval-jobs did not also cause evaluation to use more RAM, and if we set the worker maximum memory usage above the default, we still get evaluation caching.

Don't get me wrong, if it's a better strategy, we can also have this by default and I also would like to have too many tuneables, that users have to adapt to. But it would be good to see what the impact of this option would be in any case even if it was just an runtime flag.

Mic92 commented 2 months ago

Also nix-eval-jobs doesn't do any evaluation caching. It also wouldn't be very useful since in most cases it gets run on new commits anyway that would bust the cache.

Mic92 commented 2 months ago

This is how it can be bench-marked later, once the gc is fixed:

git clone https://github.com/TUM-DSE/doctor-cluster-config
cd doctor-cluster-config
hyperfine --warmup 1 'nix run --refresh github:nix-community/nix-eval-jobs  -- --flake ".#checks" --force-recurse' 'nix run --refresh github:GrahamDennis/nix-eval-jobs/gdennis/enable-memory-garbage-collection -- --flake ".#checks" --force-recurse'

GrahamDennis commented 2 months ago

@Mic92 : If you prefer, I'm happy to put this behind an option. The motivation for me was that when evaluating NixOS systems, memory usage would grow substantially, particularly for systems with specialisations. Before this change, evaluating NixOS modules was consuming 20GB of RAM and after a single evaluation was around 8GB. This meant that using nix-eval-jobs did not also cause evaluation to use more RAM, and if we set the worker maximum memory usage above the default, we still get evaluation caching.

Don't get me wrong, if it's a better strategy, we can also have this by default and I also would like to have too many tuneables, that users have to adapt to. But it would be good to see what the impact of this option would be in any case even if it was just an runtime flag.

Some anecdata based on a different build of largely NixOS configurations:

Before this change took 365.09 CPU seconds with a wall clock time of 3:06.67 and consumed peak memory of 15GB across 4 workers
After this change the build took 470.97 CPU seconds with a wall clock time of 3:04.27 and consumed peak memory of 10GB

GrahamDennis commented 2 months ago

This is how it can be bench-marked later, once the gc is fixed:

git clone https://github.com/TUM-DSE/doctor-cluster-config
cd doctor-cluster-config
hyperfine --warmup 1 'nix run --refresh github:nix-community/nix-eval-jobs  -- --flake ".#checks" --force-recurse' 'nix run --refresh github:GrahamDennis:gdennis/nix-eval-jobs#enable-memory-garbage-collection -- --flake ".#checks" --force-recurse'

Ahh, yep. I missed the GC_allow_register_threads() call. Added here: https://github.com/nix-community/nix-eval-jobs/pull/310/commits/7ec5ac62f630f768cfacefa681399c13a7dbfe8e#diff-a79ded172fd76747492a417a39848b6c25c14238e65971e6a05fe81706d5048fR303

GrahamDennis commented 2 months ago

@Mic92 : Here's the requested benchmark results (thanks for explicitly describing how to do this!)

[ec2-user@ip-172-31-37-87 doctor-cluster-config]$ hyperfine --warmup 1 'nix run --refresh github:nix-community/nix-eval-jobs  -- --flake ".#checks" --force-recurse' 'nix run --refresh github:GrahamDennis/nix-eval-jobs/gdennis/enable-memory-garbage-collection -- --flake ".#checks" --force-recurse'
Benchmark 1: nix run --refresh github:nix-community/nix-eval-jobs  -- --flake ".#checks" --force-recurse
  Time (mean ± σ):     412.362 s ±  0.351 s    [User: 214.559 s, System: 73.428 s]
  Range (min … max):   411.671 s … 413.020 s    10 runs

Benchmark 2: nix run --refresh github:GrahamDennis/nix-eval-jobs/gdennis/enable-memory-garbage-collection -- --flake ".#checks" --force-recurse
  Time (mean ± σ):     360.886 s ±  1.010 s    [User: 233.327 s, System: 44.110 s]
  Range (min … max):   359.497 s … 362.201 s    10 runs

Summary
  nix run --refresh github:GrahamDennis/nix-eval-jobs/gdennis/enable-memory-garbage-collection -- --flake ".#checks" --force-recurse ran
    1.14 ± 0.00 times faster than nix run --refresh github:nix-community/nix-eval-jobs  -- --flake ".#checks" --force-recurse

I must admit that I wasn't expecting a performance improvement with this change. This test was run on an AWS m5a.xlarge, which has 4 CPUs and 16GB of RAM. It looked like the job was only consuming a few GB of RAM so I believe this is more than enough.

GrahamDennis commented 2 months ago

Taking a closer look at the breakdown, my change definitely increased user time (this makes sense), however it was more than compensated for by a reduction in system time.

If you have the ability, would you mind independently reproducing these results on different hardware?

Mic92 commented 2 months ago

hyperfine --warmup 1 'nix run --refresh github:nix-community/nix-eval-jobs  -- --workers 64 --flake ".#checks" --force-recurse' 'nix run --refresh github:GrahamDennis/nix-eval-jobs/gdennis/enable-memory-garbage-collection -- --workers 64 --flake ".#checks" --force-recurse'
Benchmark 1: nix run --refresh github:nix-community/nix-eval-jobs  -- --workers 64 --flake ".#checks" --force-recurse
  Time (mean ± σ):     40.305 s ± 14.477 s    [User: 239.541 s, System: 81.986 s]
  Range (min … max):   14.090 s … 52.695 s    10 runs

Benchmark 2: nix run --refresh github:GrahamDennis/nix-eval-jobs/gdennis/enable-memory-garbage-collection -- --workers 64 --flake ".#checks" --force-recurse
  Time (mean ± σ):     55.999 s ±  2.288 s    [User: 584.544 s, System: 87.736 s]
  Range (min … max):   51.103 s … 58.996 s    10 runs

Summary
  nix run --refresh github:nix-community/nix-eval-jobs  -- --workers 64 --flake ".#checks" --force-recurse ran
    1.39 ± 0.50 times faster than nix run --refresh github:GrahamDennis/nix-eval-jobs/gdennis/enable-memory-garbage-collection -- --workers 64 --flake ".#checks" --force-recurse
, hyperfine --warmup 1    9115.97s user 1886.25s system 1006% cpu 18:13.17 total

This is for a Dual socket Xeon(R) Gold 6326 CPU @ 2.90GHz (128GB DDR4) - Full hardware spec is here: https://github.com/TUM-DSE/doctor-cluster-config/blob/master/docs/hosts/jack.md

Mic92 commented 2 months ago

I am now going for nixpkgs instead. There it seems strangely that I am actually I/O bound rather than CPU bound - so garbage collection might make things faster here? Let's see.

GrahamDennis commented 2 months ago

hyperfine --warmup 1 'nix run --refresh github:nix-community/nix-eval-jobs  -- --workers 64 --flake ".#checks" --force-recurse' 'nix run --refresh github:GrahamDennis/nix-eval-jobs/gdennis/enable-memory-garbage-collection -- --workers 64 --flake ".#checks" --force-recurse'
Benchmark 1: nix run --refresh github:nix-community/nix-eval-jobs  -- --workers 64 --flake ".#checks" --force-recurse
  Time (mean ± σ):     40.305 s ± 14.477 s    [User: 239.541 s, System: 81.986 s]
  Range (min … max):   14.090 s … 52.695 s    10 runs

Benchmark 2: nix run --refresh github:GrahamDennis/nix-eval-jobs/gdennis/enable-memory-garbage-collection -- --workers 64 --flake ".#checks" --force-recurse
  Time (mean ± σ):     55.999 s ±  2.288 s    [User: 584.544 s, System: 87.736 s]
  Range (min … max):   51.103 s … 58.996 s    10 runs

Summary
  nix run --refresh github:nix-community/nix-eval-jobs  -- --workers 64 --flake ".#checks" --force-recurse ran
    1.39 ± 0.50 times faster than nix run --refresh github:GrahamDennis/nix-eval-jobs/gdennis/enable-memory-garbage-collection -- --workers 64 --flake ".#checks" --force-recurse
, hyperfine --warmup 1    9115.97s user 1886.25s system 1006% cpu 18:13.17 total

This is for a Dual socket Xeon(R) Gold 6326 CPU @ 2.90GHz (128GB DDR4) - Full hardware spec is here: https://github.com/TUM-DSE/doctor-cluster-config/blob/master/docs/hosts/jack.md

This outcome is closer to what I expected, however there's substantial variance in the first benchmark:

Time (mean ± σ): 40.305 s ± 14.477 s [User: 239.541 s, System: 81.986 s] Range (min … max): 14.090 s … 52.695 s 10 runs

On a 128GB machine with 64 workers, I could imagine you might be hitting the limits of available memory (although if that was the case I might have expected the maximum times to be much longer).

Do you think it would make sense to re-run with 32 workers? Although that really should only make benchmark 1 even better compared to benchmark 2. So it's not looking too good for this PR.

Mic92 commented 2 months ago

Yeah. I can do another run with less workers. I also got some 2TB RAM machines otherwise, if we just want to compare pure computational overhead.

Mic92 commented 2 months ago

Same machine, less threads:

% hyperfine --warmup 1 'nix run --refresh github:nix-community/nix-eval-jobs  -- --workers 16 --flake ".#checks" --force-recurse' 'nix run --refresh github:GrahamDennis/nix-eval-jobs/gdennis/enable-memory-garbage-collection -- --workers 16 --flake ".#checks" --force-recurse'
Benchmark 1: nix run --refresh github:nix-community/nix-eval-jobs  -- --workers 16 --flake ".#checks" --force-recurse
  Time (mean ± σ):     38.454 s ±  2.140 s    [User: 171.552 s, System: 55.075 s]
  Range (min … max):   35.049 s … 41.918 s    10 runs

Benchmark 2: nix run --refresh github:GrahamDennis/nix-eval-jobs/gdennis/enable-memory-garbage-collection -- --workers 16 --flake ".#checks" --force-recurse
  Time (mean ± σ):     49.515 s ±  2.527 s    [User: 329.075 s, System: 47.395 s]
  Range (min … max):   45.880 s … 55.076 s    10 runs

Summary
  nix run --refresh github:nix-community/nix-eval-jobs  -- --workers 16 --flake ".#checks" --force-recurse ran
    1.29 ± 0.10 times faster than nix run --refresh github:GrahamDennis/nix-eval-jobs/gdennis/enable-memory-garbage-collection -- --workers 16 --flake ".#checks" --force-recurse
 hyperfine --warmup 1    5511.43s user 1127.80s system 691% cpu 15:59.77 total

I think I didn't even had 64 jobs to evaluate to begin with to be honest.

GrahamDennis commented 2 months ago

Same machine, less threads:

% hyperfine --warmup 1 'nix run --refresh github:nix-community/nix-eval-jobs  -- --workers 16 --flake ".#checks" --force-recurse' 'nix run --refresh github:GrahamDennis/nix-eval-jobs/gdennis/enable-memory-garbage-collection -- --workers 16 --flake ".#checks" --force-recurse'
Benchmark 1: nix run --refresh github:nix-community/nix-eval-jobs  -- --workers 16 --flake ".#checks" --force-recurse
  Time (mean ± σ):     38.454 s ±  2.140 s    [User: 171.552 s, System: 55.075 s]
  Range (min … max):   35.049 s … 41.918 s    10 runs

Benchmark 2: nix run --refresh github:GrahamDennis/nix-eval-jobs/gdennis/enable-memory-garbage-collection -- --workers 16 --flake ".#checks" --force-recurse
  Time (mean ± σ):     49.515 s ±  2.527 s    [User: 329.075 s, System: 47.395 s]
  Range (min … max):   45.880 s … 55.076 s    10 runs

Summary
  nix run --refresh github:nix-community/nix-eval-jobs  -- --workers 16 --flake ".#checks" --force-recurse ran
    1.29 ± 0.10 times faster than nix run --refresh github:GrahamDennis/nix-eval-jobs/gdennis/enable-memory-garbage-collection -- --workers 16 --flake ".#checks" --force-recurse
 hyperfine --warmup 1    5511.43s user 1127.80s system 691% cpu 15:59.77 total

I think I didn't even had 64 jobs to evaluate to begin with to be honest.

OK, that's pretty clear. I'll need to take a different approach to see if I can minimise the GC overhead.

nix-community / nix-eval-jobs

Enable Boehm memory garbage collection #310