On a machine with many cores, snapshotting large repos is very CPU-intensive

mindajar commented 2 months ago

Description

Steps to Reproduce the Problem

Check out a very large repo (here, n = ~150,000 files) on a machine with many cores (here, n = 24)
Run jj st (or just jj) and measure the time it takes for the command to complete.
export RAYON_NUM_THREADS=4
Repeat step 2

Expected Behavior

Similar performance in both cases.

Actual Behavior

jj's default behavior: 2.28 real 0.34 user 32.26 sys jj limited to four threads: 1.13 real 0.16 user 0.49 sys

This one took some doing and profiling to figure out, as it didn't immediately make sense that the same working copy is so much faster to work with on a much smaller machine.

Specifications

Platform: macOS
Version: 14, 15

arxanas commented 2 months ago

Interesting. Can't look now, but it's possible we serialize all tree updates into a single channel and it's contending when there are more threads. It would be in the working copy snapshot code somewhere.

One fix could be to batch the channel sends.
Another idea could be to use Rayon's parallel iterator bridge to consume the channel contents (which would do batching/work stealing).
It might work to do an explicit map/reduce across worker threads, although I don't recall if Rayon easily supports this pattern.

To fix your immediate issue, you can also try enabling the Watchman fsmonitor.

yuja commented 2 months ago

I heard (iirc when I was working on Mercurial) that it's sometimes faster to scan directory entries sequentially than splitting jobs to worker processes, which tends to lead to random access. I don't know this is the case, though.

mindajar commented 2 months ago

Not even a little bit urgent -- I was mostly bewildered at what I could possibly have broken on the big machine to make jj status peg all CPUs, and I couldn't stop poking at it until I figured it out :)

(watchman is currently broken in MacPorts, which is how I ended up here)

thoughtpolice commented 2 months ago

I can't reproduce this on my 32 core Zen 1 machine (Linux 6.10) with gecko-dev, which is ~1mil commits and 380k working set files. In fact it never gets slower, but it's nowhere close to linear speedup; 32 cores is only ~2x faster than 4 cores (1.20s vs 0.7s). Would you be willing to try this with a repository like gecko-dev and report back to see what it says? It would make it easier for baseline comparisons, at least.

I suspect two things:

macOS. As we all know, it's an operating system that is different from Linux.
Hardware. You're probably using one of those Mac Studios? I'm guessing it's the 16P+8E configuration. Trying to do this across every core probably has some consequence. I don't remember how to do this but there are ways to force apps to use specific subsets of cores. Would you be willing to test a few combinations of threads on different cores with various RAYON_NUM_THREADS? e.g. 16 P cores versus 8 P cores versus 8 E cores?

I don't have a Studio but I do have a M2 Air, which coincidentally dual boots Fedora. So, if I get a chance I can see how it all shakes out on both systems, Linux vs macOS, but it's only 4P+4E, so it's not going to be as big a deal I suspect.

If it turns out that some other core configuration gives big improvements we can probably make a change to the scheduling policy somehow before we use Rayon so jj sticks to the right settings; a blunt hammer can then be applied with some patch to achieve that.

Note that I couldn't reliably clone gecko-dev from GitHub in one-go due to network errors, so I had to clone a 1-height shallow repo and then 'saturate it' by unshallowing for it to work:

git clone https://github.com/mozilla/gecko-dev --depth 1
cd gecko-dev
git fetch --unshallow
jj git init --colocate

mindajar commented 2 months ago

Yes, this is a 16P+8E Mac Studio.

I noticed while testing this that OS caches seem to get evicted pretty quickly; after not that many seconds, a re-run is noticeably slower. I don't understand why, but thought it was interesting.

I've not figured out how to control QoS to the degree you describe, but taskpolicy(8) offers some coarse-grained control. A (lightly edited for readability) transcript:

gecko-dev % jj version
jj 0.21.0-ac605d2e7bc71e462515f8c423fbc0437f18b363
gecko-dev % jj st
The working copy is clean
Working copy : snzmnpzp 24928d98 (empty) (no description set)
Parent commit: srvlssxw 50498861 master | Backed out changeset 58983adca2f1 (bug 1916328) for causing dt failures @ browser_parsable_css.js
gecko-dev % jj file list | wc -l
  373973
gecko-dev % echo $RAYON_NUM_THREADS

gecko-dev % hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):      4.099 s ±  0.517 s    [User: 2.834 s, System: 55.348 s]
  Range (min … max):    3.697 s …  5.237 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c background jj st'
Benchmark 1: taskpolicy -c background jj st
  Time (mean ± σ):      6.803 s ±  0.418 s    [User: 5.987 s, System: 38.212 s]
  Range (min … max):    6.267 s …  7.599 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c maintenance jj st'
Benchmark 1: taskpolicy -c maintenance jj st
  Time (mean ± σ):      6.938 s ±  0.431 s    [User: 6.578 s, System: 49.789 s]
  Range (min … max):    6.014 s …  7.399 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c utility jj st'
Benchmark 1: taskpolicy -c utility jj st
  Time (mean ± σ):      4.249 s ±  0.371 s    [User: 2.839 s, System: 58.087 s]
  Range (min … max):    3.853 s …  5.065 s    10 runs

gecko-dev % export RAYON_NUM_THREADS=8
gecko-dev % hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):      2.341 s ±  0.018 s    [User: 1.710 s, System: 9.140 s]
  Range (min … max):    2.319 s …  2.376 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c background jj st'
Benchmark 1: taskpolicy -c background jj st
  Time (mean ± σ):      6.951 s ±  0.447 s    [User: 5.700 s, System: 27.704 s]
  Range (min … max):    6.319 s …  7.838 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c maintenance jj st'
Benchmark 1: taskpolicy -c maintenance jj st
  Time (mean ± σ):      7.003 s ±  0.786 s    [User: 5.561 s, System: 27.330 s]
  Range (min … max):    5.456 s …  8.334 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c utility jj st'
Benchmark 1: taskpolicy -c utility jj st
  Time (mean ± σ):      2.567 s ±  0.110 s    [User: 1.731 s, System: 9.194 s]
  Range (min … max):    2.366 s …  2.692 s    10 runs

gecko-dev % export RAYON_NUM_THREADS=4
gecko-dev % hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):      3.232 s ±  0.279 s    [User: 1.427 s, System: 5.208 s]
  Range (min … max):    2.951 s …  3.898 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c background jj st'
Benchmark 1: taskpolicy -c background jj st
  Time (mean ± σ):      9.691 s ±  0.729 s    [User: 5.024 s, System: 21.260 s]
  Range (min … max):    7.840 s … 10.256 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c maintenance jj st'
Benchmark 1: taskpolicy -c maintenance jj st
  Time (mean ± σ):      9.670 s ±  0.735 s    [User: 4.990 s, System: 21.110 s]
  Range (min … max):    8.341 s … 10.393 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c utility jj st'
Benchmark 1: taskpolicy -c utility jj st
  Time (mean ± σ):      3.784 s ±  0.211 s    [User: 1.476 s, System: 5.713 s]
  Range (min … max):    3.454 s …  4.170 s    10 runs

jfchevrette commented 6 days ago

I'm seeing the same issue on my M3 Pro mac

The repo I'm testing with is https://github.com/NixOS/nixpkgs/

$ jj version
jj 0.23.0

$ sw_vers
ProductName:            macOS
ProductVersion:         15.1
BuildVersion:           24B83

$ sysctl -n machdep.cpu.brand_string
Apple M3 Pro

$ echo $RAYON_NUM_THREADS

$ hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):      1.799 s ±  0.519 s    [User: 0.564 s, System: 17.117 s]
  Range (min … max):    1.233 s …  2.600 s    10 runs

$ export RAYON_NUM_THREADS=8
$ hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):     776.6 ms ±  33.5 ms    [User: 453.1 ms, System: 4901.4 ms]
  Range (min … max):   748.0 ms … 856.7 ms    10 runs

$ export RAYON_NUM_THREADS=4
$ hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):     454.3 ms ±   9.3 ms    [User: 357.5 ms, System: 1099.9 ms]
  Range (min … max):   445.3 ms … 475.3 ms    10 runs

$ export RAYON_NUM_THREADS=2
$ hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):     679.7 ms ±  14.9 ms    [User: 332.4 ms, System: 908.6 ms]
  Range (min … max):   649.3 ms … 704.0 ms    10 runs

martinvonz / jj