Closed konstin closed 3 months ago
Just dropping a back link to https://github.com/pubgrub-rs/pubgrub/pull/59
The new version is slightly slower but passes the tests:
$ hyperfine --warmup 1 --runs 3 "target/profiling/main bio_embeddings.in" "target/profiling/branch bio_embeddings.in"
Benchmark 1: target/profiling/main bio_embeddings.in
Time (mean ± σ): 12.007 s ± 0.035 s [User: 11.654 s, System: 0.301 s]
Range (min … max): 11.977 s … 12.045 s 3 runs
Benchmark 2: target/profiling/branch bio_embeddings.in
Time (mean ± σ): 6.961 s ± 0.017 s [User: 6.565 s, System: 0.323 s]
Range (min … max): 6.942 s … 6.974 s 3 runs
Summary
target/profiling/branch bio_embeddings.in ran
1.72 ± 0.01 times faster than target/profiling/main bio_embeddings.in
It spends 45% in union, mostly Arc
drop, add and vec pushes.
I had trouble thinking useful thoughts looking at the full diff all at once. So I pulled it into my editor and started a new branch adding each new piece of functionality one at a time in separate commits. Then adding the tests and cleaning up the nits I saw before moving onto the next piece of functionality. The result is https://github.com/pubgrub-rs/pubgrub/tree/jf/177-review consider it a "yes and" to the wonderful work in this PR. But I'm not sure on a practical level how to take the next step. Do you want to pull those changes into this PR? Or should I start a new one?
Anyway, I'm gonna go run some benchmarks.
Thanks! I forced pushed your commits to this branch (there's backup branch just in case also) and added a clippy commit on top.
time | Dev | PR | % |
---|---|---|---|
sudoku | 44s | 30s | 32% |
elm | 274ms | 277ms | --- |
large | 11ms | 10ms | 9% |
slow_135 | 19ms | 6.3ms | 66% |
zuse | 311 ms | 297ms | 4.5% |
hobo* | 1045s | 851s | 18% |
@mpizenberg does the "real world" improvement of 12s to 3s on the bio_embeddings
requirements in uv
make this feel more worthwhile? Does that suggest that there is a missing niche in the benchmarks for the project (i.e. that Jacob ran)?
@mpizenberg does the "real world" improvement of 12s to 3s on the
bio_embeddings
requirements inuv
make this feel more worthwhile? Does that suggest that there is a missing niche in the benchmarks for the project (i.e. that Jacob ran)?
Indeed that's a substantial difference!
@mpizenberg does the "real world" improvement of 12s to 3s on the bio_embeddings
Do you know if some of the inlinings produce most of the improvements, or if they all make small improvements that total to the big difference?
Do you know if some of the inlinings produce most of the improvements, or if they all make small improvements that total to the big difference?
This work is entirely profiling and benchmark driven. I started off with two smaller problems, Range::contains
being slower than needed and excessive clone/drop in prior_cause
. After the first two commits you can see in the flamegraphs in the PR description it's almost entirely union
and intersection
calls, being slow due to excessive clone/drop (cor...
in the flamegraphs). We had previously seen the slow drop/clone and put our version in an Arc
, so all you see now are clone/drop on Arc
s.
satisfier
was calling intersection a lot, but it wasn't interested in the intersection itself, it just wanted to know if the terms are disjoint, so i special cased this into is_disjoint
. The union calls were calling intersection once and complement three times, so i unrolled union into a version that clones as few as possible. I don't have numbers on hand for individual changes, but each made a noticeable difference.
I was previously skeptic of this kind of inlining degrading the expressiveness of the computations.
Yesterday I went back and looked at my code from when I originally proposed this. You are right to reject that code. It was inscrutable. The code Konstin has now is much better.
does the "real world" improvement of 12s to 3s on the
bio_embeddings
requirements inuv
make this feel more worthwhile? Does that suggest that there is a missing niche in the benchmarks for the project (i.e. that Jacob ran)?
There is definitely a gap in the benchmarking infrastructure I have available. I can't seem to get a benchmark that has the performance characteristics of your real world data. Help on this would definitely be appreciated. The closest I have is slow_135
, which is not nearly as slow as your original report, but has a similar shaped profile. The 66% improvement for that benchmark is nicely similar to the 75% your reporting.
We had previously seen the slow drop/clone and put our version in an
Arc
, so all you see now are clone/drop onArc
s.
I should be able to run all benchmarks with version wrapped in Arc
s. I wonder if that will give me graphs that looked more like yours.
I added two more small optimizations on top for your review https://github.com/pubgrub-rs/pubgrub/tree/jf/177-review
No difference in performance when cherry-picking the three commits on top:
$ taskset -c 0 hyperfine --warmup 1 --runs 3 "./main-uv pip compile scripts/requirements/bio_embeddings.in" "./branch-uv pip compile scripts/requirements/bio_embeddings.in"
Benchmark 1: ./main-uv pip compile scripts/requirements/bio_embeddings.in
Time (mean ± σ): 2.821 s ± 0.007 s [User: 2.510 s, System: 0.303 s]
Range (min … max): 2.813 s … 2.827 s 3 runs
Benchmark 2: ./branch-uv pip compile scripts/requirements/bio_embeddings.in
Time (mean ± σ): 2.826 s ± 0.001 s [User: 2.527 s, System: 0.291 s]
Range (min … max): 2.825 s … 2.827 s 3 runs
Summary
./main-uv pip compile scripts/requirements/bio_embeddings.in ran
1.00 ± 0.00 times faster than ./branch-uv pip compile scripts/requirements/bio_embeddings.in
time | Dev | PR | % |
---|---|---|---|
sudoku | 44s | 22s | -50% |
elm | 274ms | 249.07 ms | -9% |
large | 11ms | 9.8821 ms | -7% |
slow_135 | 19ms | 6.6884 ms | -63% |
zuse | 311 ms | 285.94 ms | -6% |
hobo* | 985.2s | 803.6s | -18% |
arc_elm | 381.19 ms | 318.25 ms | -16% |
arc_large | 15.674 ms | 14.214 ms | -9% |
arc_slow_135 | 36.262 ms | 14.885 ms | -59% |
arc_zuse | 387.93 ms | 340.78 ms | -12% |
In uv, we have a very slow case with bio_embeddings, which include boto3 and botocore. In the fully cached case, the great majority of time is spent in range operation with large ranges due to the many version. I replaced some range operations implemented by their mathematical definitions with optimized versions. This PR is based on profiling my test case,
bio_embeddings[all]
on python 3.12.To reproduce, checkout
git clone https://github.com/astral-sh/uv
and replace the pubgrub dependency with https://github.com/astral-sh/pubgrub/pull/25, build withcargo build --profile profiling
and run withsamply record
, i ran with--rate 5000
.before 12s
after 3s
The atomic/drop operations are clone and drop for our version type, which is an
Arc
.This scenarios slow because we try a lot of versions:
(Taskset descreases the variance for a minimal slowdown, we're bound on single-threaded pubgrub in this benchmark and not on multithreaded io/parsing.)
I'd say this closes #135.
The lack combination of lack of
Ord
onBound
andUnbounded
being-inf
orinf
depending on the (untyped) context makes implementing these manually tedious, i wonder if we could add some abstraction?The first two commits of this PR can be split out into separate PRs. From main over each of the three commits: