martinvonz / jj

A Git-compatible VCS that is both simple and powerful
https://martinvonz.github.io/jj/
Apache License 2.0
9.47k stars 324 forks source link

On a machine with many cores, snapshotting large repos is very CPU-intensive #4508

Open mindajar opened 2 months ago

mindajar commented 2 months ago

Description

On a machine with many cores, snapshotting large repos is very CPU-intensive

Steps to Reproduce the Problem

  1. Check out a very large repo (here, n = ~150,000 files) on a machine with many cores (here, n = 24)
  2. Run jj st (or just jj) and measure the time it takes for the command to complete.
  3. export RAYON_NUM_THREADS=4
  4. Repeat step 2

Expected Behavior

Similar performance in both cases.

Actual Behavior

jj's default behavior: 2.28 real 0.34 user 32.26 sys jj limited to four threads: 1.13 real 0.16 user 0.49 sys

This one took some doing and profiling to figure out, as it didn't immediately make sense that the same working copy is so much faster to work with on a much smaller machine.

Specifications

arxanas commented 2 months ago

Interesting. Can't look now, but it's possible we serialize all tree updates into a single channel and it's contending when there are more threads. It would be in the working copy snapshot code somewhere.

To fix your immediate issue, you can also try enabling the Watchman fsmonitor.

yuja commented 2 months ago

I heard (iirc when I was working on Mercurial) that it's sometimes faster to scan directory entries sequentially than splitting jobs to worker processes, which tends to lead to random access. I don't know this is the case, though.

mindajar commented 2 months ago

Not even a little bit urgent -- I was mostly bewildered at what I could possibly have broken on the big machine to make jj status peg all CPUs, and I couldn't stop poking at it until I figured it out :)

(watchman is currently broken in MacPorts, which is how I ended up here)

thoughtpolice commented 2 months ago

I can't reproduce this on my 32 core Zen 1 machine (Linux 6.10) with gecko-dev, which is ~1mil commits and 380k working set files. In fact it never gets slower, but it's nowhere close to linear speedup; 32 cores is only ~2x faster than 4 cores (1.20s vs 0.7s). Would you be willing to try this with a repository like gecko-dev and report back to see what it says? It would make it easier for baseline comparisons, at least.

I suspect two things:

I don't have a Studio but I do have a M2 Air, which coincidentally dual boots Fedora. So, if I get a chance I can see how it all shakes out on both systems, Linux vs macOS, but it's only 4P+4E, so it's not going to be as big a deal I suspect.

If it turns out that some other core configuration gives big improvements we can probably make a change to the scheduling policy somehow before we use Rayon so jj sticks to the right settings; a blunt hammer can then be applied with some patch to achieve that.


Note that I couldn't reliably clone gecko-dev from GitHub in one-go due to network errors, so I had to clone a 1-height shallow repo and then 'saturate it' by unshallowing for it to work:

git clone https://github.com/mozilla/gecko-dev --depth 1
cd gecko-dev
git fetch --unshallow
jj git init --colocate
mindajar commented 2 months ago

Yes, this is a 16P+8E Mac Studio.

I noticed while testing this that OS caches seem to get evicted pretty quickly; after not that many seconds, a re-run is noticeably slower. I don't understand why, but thought it was interesting.

I've not figured out how to control QoS to the degree you describe, but taskpolicy(8) offers some coarse-grained control. A (lightly edited for readability) transcript:

gecko-dev % jj version
jj 0.21.0-ac605d2e7bc71e462515f8c423fbc0437f18b363
gecko-dev % jj st
The working copy is clean
Working copy : snzmnpzp 24928d98 (empty) (no description set)
Parent commit: srvlssxw 50498861 master | Backed out changeset 58983adca2f1 (bug 1916328) for causing dt failures @ browser_parsable_css.js
gecko-dev % jj file list | wc -l
  373973
gecko-dev % echo $RAYON_NUM_THREADS

gecko-dev % hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):      4.099 s ±  0.517 s    [User: 2.834 s, System: 55.348 s]
  Range (min … max):    3.697 s …  5.237 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c background jj st'
Benchmark 1: taskpolicy -c background jj st
  Time (mean ± σ):      6.803 s ±  0.418 s    [User: 5.987 s, System: 38.212 s]
  Range (min … max):    6.267 s …  7.599 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c maintenance jj st'
Benchmark 1: taskpolicy -c maintenance jj st
  Time (mean ± σ):      6.938 s ±  0.431 s    [User: 6.578 s, System: 49.789 s]
  Range (min … max):    6.014 s …  7.399 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c utility jj st'
Benchmark 1: taskpolicy -c utility jj st
  Time (mean ± σ):      4.249 s ±  0.371 s    [User: 2.839 s, System: 58.087 s]
  Range (min … max):    3.853 s …  5.065 s    10 runs

gecko-dev % export RAYON_NUM_THREADS=8
gecko-dev % hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):      2.341 s ±  0.018 s    [User: 1.710 s, System: 9.140 s]
  Range (min … max):    2.319 s …  2.376 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c background jj st'
Benchmark 1: taskpolicy -c background jj st
  Time (mean ± σ):      6.951 s ±  0.447 s    [User: 5.700 s, System: 27.704 s]
  Range (min … max):    6.319 s …  7.838 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c maintenance jj st'
Benchmark 1: taskpolicy -c maintenance jj st
  Time (mean ± σ):      7.003 s ±  0.786 s    [User: 5.561 s, System: 27.330 s]
  Range (min … max):    5.456 s …  8.334 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c utility jj st'
Benchmark 1: taskpolicy -c utility jj st
  Time (mean ± σ):      2.567 s ±  0.110 s    [User: 1.731 s, System: 9.194 s]
  Range (min … max):    2.366 s …  2.692 s    10 runs

gecko-dev % export RAYON_NUM_THREADS=4
gecko-dev % hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):      3.232 s ±  0.279 s    [User: 1.427 s, System: 5.208 s]
  Range (min … max):    2.951 s …  3.898 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c background jj st'
Benchmark 1: taskpolicy -c background jj st
  Time (mean ± σ):      9.691 s ±  0.729 s    [User: 5.024 s, System: 21.260 s]
  Range (min … max):    7.840 s … 10.256 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c maintenance jj st'
Benchmark 1: taskpolicy -c maintenance jj st
  Time (mean ± σ):      9.670 s ±  0.735 s    [User: 4.990 s, System: 21.110 s]
  Range (min … max):    8.341 s … 10.393 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c utility jj st'
Benchmark 1: taskpolicy -c utility jj st
  Time (mean ± σ):      3.784 s ±  0.211 s    [User: 1.476 s, System: 5.713 s]
  Range (min … max):    3.454 s …  4.170 s    10 runs
jfchevrette commented 6 days ago

I'm seeing the same issue on my M3 Pro mac

The repo I'm testing with is https://github.com/NixOS/nixpkgs/

$ jj version
jj 0.23.0

$ sw_vers
ProductName:            macOS
ProductVersion:         15.1
BuildVersion:           24B83

$ sysctl -n machdep.cpu.brand_string
Apple M3 Pro

$ echo $RAYON_NUM_THREADS

$ hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):      1.799 s ±  0.519 s    [User: 0.564 s, System: 17.117 s]
  Range (min … max):    1.233 s …  2.600 s    10 runs

$ export RAYON_NUM_THREADS=8
$ hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):     776.6 ms ±  33.5 ms    [User: 453.1 ms, System: 4901.4 ms]
  Range (min … max):   748.0 ms … 856.7 ms    10 runs

$ export RAYON_NUM_THREADS=4
$ hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):     454.3 ms ±   9.3 ms    [User: 357.5 ms, System: 1099.9 ms]
  Range (min … max):   445.3 ms … 475.3 ms    10 runs

$ export RAYON_NUM_THREADS=2
$ hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):     679.7 ms ±  14.9 ms    [User: 332.4 ms, System: 908.6 ms]
  Range (min … max):   649.3 ms … 704.0 ms    10 runs