Open mindajar opened 2 months ago
Interesting. Can't look now, but it's possible we serialize all tree updates into a single channel and it's contending when there are more threads. It would be in the working copy snapshot code somewhere.
To fix your immediate issue, you can also try enabling the Watchman fsmonitor.
I heard (iirc when I was working on Mercurial) that it's sometimes faster to scan directory entries sequentially than splitting jobs to worker processes, which tends to lead to random access. I don't know this is the case, though.
Not even a little bit urgent -- I was mostly bewildered at what I could possibly have broken on the big machine to make jj status
peg all CPUs, and I couldn't stop poking at it until I figured it out :)
(watchman is currently broken in MacPorts, which is how I ended up here)
I can't reproduce this on my 32 core Zen 1 machine (Linux 6.10) with gecko-dev
, which is ~1mil commits and 380k working set files. In fact it never gets slower, but it's nowhere close to linear speedup; 32 cores is only ~2x faster than 4 cores (1.20s vs 0.7s). Would you be willing to try this with a repository like gecko-dev
and report back to see what it says? It would make it easier for baseline comparisons, at least.
I suspect two things:
RAYON_NUM_THREADS
? e.g. 16 P cores versus 8 P cores versus 8 E cores?I don't have a Studio but I do have a M2 Air, which coincidentally dual boots Fedora. So, if I get a chance I can see how it all shakes out on both systems, Linux vs macOS, but it's only 4P+4E, so it's not going to be as big a deal I suspect.
If it turns out that some other core configuration gives big improvements we can probably make a change to the scheduling policy somehow before we use Rayon so jj
sticks to the right settings; a blunt hammer can then be applied with some patch to achieve that.
Note that I couldn't reliably clone gecko-dev
from GitHub in one-go due to network errors, so I had to clone a 1-height shallow repo and then 'saturate it' by unshallowing for it to work:
git clone https://github.com/mozilla/gecko-dev --depth 1
cd gecko-dev
git fetch --unshallow
jj git init --colocate
Yes, this is a 16P+8E Mac Studio.
I noticed while testing this that OS caches seem to get evicted pretty quickly; after not that many seconds, a re-run is noticeably slower. I don't understand why, but thought it was interesting.
I've not figured out how to control QoS to the degree you describe, but taskpolicy(8)
offers some coarse-grained control. A (lightly edited for readability) transcript:
gecko-dev % jj version
jj 0.21.0-ac605d2e7bc71e462515f8c423fbc0437f18b363
gecko-dev % jj st
The working copy is clean
Working copy : snzmnpzp 24928d98 (empty) (no description set)
Parent commit: srvlssxw 50498861 master | Backed out changeset 58983adca2f1 (bug 1916328) for causing dt failures @ browser_parsable_css.js
gecko-dev % jj file list | wc -l
373973
gecko-dev % echo $RAYON_NUM_THREADS
gecko-dev % hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
Time (mean ± σ): 4.099 s ± 0.517 s [User: 2.834 s, System: 55.348 s]
Range (min … max): 3.697 s … 5.237 s 10 runs
gecko-dev % hyperfine --warmup 3 'taskpolicy -c background jj st'
Benchmark 1: taskpolicy -c background jj st
Time (mean ± σ): 6.803 s ± 0.418 s [User: 5.987 s, System: 38.212 s]
Range (min … max): 6.267 s … 7.599 s 10 runs
gecko-dev % hyperfine --warmup 3 'taskpolicy -c maintenance jj st'
Benchmark 1: taskpolicy -c maintenance jj st
Time (mean ± σ): 6.938 s ± 0.431 s [User: 6.578 s, System: 49.789 s]
Range (min … max): 6.014 s … 7.399 s 10 runs
gecko-dev % hyperfine --warmup 3 'taskpolicy -c utility jj st'
Benchmark 1: taskpolicy -c utility jj st
Time (mean ± σ): 4.249 s ± 0.371 s [User: 2.839 s, System: 58.087 s]
Range (min … max): 3.853 s … 5.065 s 10 runs
gecko-dev % export RAYON_NUM_THREADS=8
gecko-dev % hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
Time (mean ± σ): 2.341 s ± 0.018 s [User: 1.710 s, System: 9.140 s]
Range (min … max): 2.319 s … 2.376 s 10 runs
gecko-dev % hyperfine --warmup 3 'taskpolicy -c background jj st'
Benchmark 1: taskpolicy -c background jj st
Time (mean ± σ): 6.951 s ± 0.447 s [User: 5.700 s, System: 27.704 s]
Range (min … max): 6.319 s … 7.838 s 10 runs
gecko-dev % hyperfine --warmup 3 'taskpolicy -c maintenance jj st'
Benchmark 1: taskpolicy -c maintenance jj st
Time (mean ± σ): 7.003 s ± 0.786 s [User: 5.561 s, System: 27.330 s]
Range (min … max): 5.456 s … 8.334 s 10 runs
gecko-dev % hyperfine --warmup 3 'taskpolicy -c utility jj st'
Benchmark 1: taskpolicy -c utility jj st
Time (mean ± σ): 2.567 s ± 0.110 s [User: 1.731 s, System: 9.194 s]
Range (min … max): 2.366 s … 2.692 s 10 runs
gecko-dev % export RAYON_NUM_THREADS=4
gecko-dev % hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
Time (mean ± σ): 3.232 s ± 0.279 s [User: 1.427 s, System: 5.208 s]
Range (min … max): 2.951 s … 3.898 s 10 runs
gecko-dev % hyperfine --warmup 3 'taskpolicy -c background jj st'
Benchmark 1: taskpolicy -c background jj st
Time (mean ± σ): 9.691 s ± 0.729 s [User: 5.024 s, System: 21.260 s]
Range (min … max): 7.840 s … 10.256 s 10 runs
gecko-dev % hyperfine --warmup 3 'taskpolicy -c maintenance jj st'
Benchmark 1: taskpolicy -c maintenance jj st
Time (mean ± σ): 9.670 s ± 0.735 s [User: 4.990 s, System: 21.110 s]
Range (min … max): 8.341 s … 10.393 s 10 runs
gecko-dev % hyperfine --warmup 3 'taskpolicy -c utility jj st'
Benchmark 1: taskpolicy -c utility jj st
Time (mean ± σ): 3.784 s ± 0.211 s [User: 1.476 s, System: 5.713 s]
Range (min … max): 3.454 s … 4.170 s 10 runs
I'm seeing the same issue on my M3 Pro mac
The repo I'm testing with is https://github.com/NixOS/nixpkgs/
$ jj version
jj 0.23.0
$ sw_vers
ProductName: macOS
ProductVersion: 15.1
BuildVersion: 24B83
$ sysctl -n machdep.cpu.brand_string
Apple M3 Pro
$ echo $RAYON_NUM_THREADS
$ hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
Time (mean ± σ): 1.799 s ± 0.519 s [User: 0.564 s, System: 17.117 s]
Range (min … max): 1.233 s … 2.600 s 10 runs
$ export RAYON_NUM_THREADS=8
$ hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
Time (mean ± σ): 776.6 ms ± 33.5 ms [User: 453.1 ms, System: 4901.4 ms]
Range (min … max): 748.0 ms … 856.7 ms 10 runs
$ export RAYON_NUM_THREADS=4
$ hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
Time (mean ± σ): 454.3 ms ± 9.3 ms [User: 357.5 ms, System: 1099.9 ms]
Range (min … max): 445.3 ms … 475.3 ms 10 runs
$ export RAYON_NUM_THREADS=2
$ hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
Time (mean ± σ): 679.7 ms ± 14.9 ms [User: 332.4 ms, System: 908.6 ms]
Range (min … max): 649.3 ms … 704.0 ms 10 runs
Description
On a machine with many cores, snapshotting large repos is very CPU-intensive
Steps to Reproduce the Problem
jj st
(or justjj
) and measure the time it takes for the command to complete.export RAYON_NUM_THREADS=4
Expected Behavior
Similar performance in both cases.
Actual Behavior
jj
's default behavior:2.28 real 0.34 user 32.26 sys
jj
limited to four threads:1.13 real 0.16 user 0.49 sys
This one took some doing and profiling to figure out, as it didn't immediately make sense that the same working copy is so much faster to work with on a much smaller machine.
Specifications