perf: specialize range operations for better performances

konstin commented 5 months ago

In uv, we have a very slow case with bio_embeddings, which include boto3 and botocore. In the fully cached case, the great majority of time is spent in range operation with large ranges due to the many version. I replaced some range operations implemented by their mathematical definitions with optimized versions. This PR is based on profiling my test case, bio_embeddings[all] on python 3.12.

To reproduce, checkout git clone https://github.com/astral-sh/uv and replace the pubgrub dependency with https://github.com/astral-sh/pubgrub/pull/25, build with cargo build --profile profiling and run with samply record, i ran with --rate 5000.

before 12s

after 3s

The atomic/drop operations are clone and drop for our version type, which is an Arc.

This scenarios slow because we try a lot of versions:

Tried 101329 versions: pytest 2628, botocore 2627, blis 1314, boto3 1314, certifi 1314, charset-normalizer 1314, click 1314, conllu 1314, contourpy 1314, cycler 1314, cymem 1314, editdistance 1314, flaky 1314, flask 1314, flask-cors 1314, fonttools 1314, fsspec 1314, ftfy 1314, gevent 1314, huggingface-hub 1314, idna 1314, jinja2 1314, joblib 1314, jsonnet 1314, jsonpickle 1314, kiwisolver 1314, murmurhash 1314, networkx 1314, nltk 1314, numba 1314, numpydoc 1314, nvidia-cublas-cu12 1314, nvidia-cuda-cupti-cu12 1314, nvidia-cuda-nvrtc-cu12 1314, nvidia-cuda-runtime-cu12 1314, nvidia-cudnn-cu12 1314, nvidia-cufft-cu12 1314, nvidia-curand-cu12 1314, nvidia-cusolver-cu12 1314, nvidia-cusparse-cu12 1314, nvidia-nccl-cu12 1314, nvidia-nvtx-cu12 1314, overrides 1314, packaging 1314, parsimonious 1314, plac 1314, preshed 1314, protobuf 1314, pynndescent 1314, pyparsing 1314, python-dateutil 1314, pytorch-pretrained-bert 1314, pytorch-transformers 1314, pyyaml 1314, regex 1314, responses 1314, ruamel-yaml-clib 1314, safetensors 1314, sentencepiece 1314, six 1314, spacy 1314, sqlparse 1314, srsly 1314, sympy 1314, tenacity 1314, tensorboardx 1314, thinc 1314, threadpoolctl 1314, tokenizers 1314, typing-extensions 1314, tzdata 1314, unidecode 1314, wasabi 1314, word2number 1314, wrapt 1314, allennlp 27, torch 20, bio-embeddings[all] 10, bio-embeddings 4, biopython 4, gensim 4, h5py 4, matplotlib 3, numpy 3, pandas 3, plotly 3, ruamel-yaml 3, scikit-learn 3, scipy 3, Python 2, appdirs 2, importlib-metadata 2, jax-unirep 2, jaxlib 2, lock 2, alabaster 1, babel 1, bio-embeddings-allennlp 1, bio-embeddings-bepler 1, bio-embeddings-cpcprot 1, bio-embeddings-esm 1, bio-embeddings-plus 1, bio-embeddings-tape-proteins 1, blinker 1, docutils 1, filelock 1, greenlet 1, imagesize 1, iniconfig 1, itsdangerous 1, jmespath 1, llvmlite 1, markupsafe 1, mpmath 1, nvidia-nvjitlink-cu12 1, pillow 1, pluggy 1, pygments 1, pytz 1, requests 1, root 1, s3transfer 1, setuptools 1, smart-open 1, snowballstemmer 1, sphinx 1, sphinxcontrib-applehelp 1, sphinxcontrib-devhelp 1, sphinxcontrib-htmlhelp 1, sphinxcontrib-jsmath 1, sphinxcontrib-qthelp 1, sphinxcontrib-serializinghtml 1, tabulate 1, tqdm 1, transformers 1, umap-learn 1, urllib3 1, wcwidth 1, werkzeug 1, zope-event 1, zope-interface 1

(Taskset descreases the variance for a minimal slowdown, we're bound on single-threaded pubgrub in this benchmark and not on multithreaded io/parsing.)

$  taskset -c 0 hyperfine --warmup 1 "../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in"  "../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in" 
Benchmark 1: ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in
  Time (mean ± σ):     12.321 s ±  0.064 s    [User: 12.014 s, System: 0.300 s]
  Range (min … max):   12.224 s … 12.406 s    10 runs

Benchmark 2: ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in
  Time (mean ± σ):      3.109 s ±  0.004 s    [User: 2.782 s, System: 0.321 s]
  Range (min … max):    3.103 s …  3.116 s    10 runs

Summary
  ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in ran
    3.96 ± 0.02 times faster than ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in

I'd say this closes #135.

The lack combination of lack of Ord on Bound and Unbounded being -inf or inf depending on the (untyped) context makes implementing these manually tedious, i wonder if we could add some abstraction?

The first two commits of this PR can be split out into separate PRs. From main over each of the three commits:

Eh2406 commented 5 months ago

Just dropping a back link to https://github.com/pubgrub-rs/pubgrub/pull/59

konstin commented 5 months ago

The new version is slightly slower but passes the tests:

$ hyperfine --warmup 1 --runs 3 "target/profiling/main bio_embeddings.in"  "target/profiling/branch bio_embeddings.in"
Benchmark 1: target/profiling/main bio_embeddings.in
  Time (mean ± σ):     12.007 s ±  0.035 s    [User: 11.654 s, System: 0.301 s]
  Range (min … max):   11.977 s … 12.045 s    3 runs

Benchmark 2: target/profiling/branch bio_embeddings.in
  Time (mean ± σ):      6.961 s ±  0.017 s    [User: 6.565 s, System: 0.323 s]
  Range (min … max):    6.942 s …  6.974 s    3 runs

Summary
  target/profiling/branch bio_embeddings.in ran
    1.72 ± 0.01 times faster than target/profiling/main bio_embeddings.in

It spends 45% in union, mostly Arc drop, add and vec pushes.

Eh2406 commented 3 months ago

I had trouble thinking useful thoughts looking at the full diff all at once. So I pulled it into my editor and started a new branch adding each new piece of functionality one at a time in separate commits. Then adding the tests and cleaning up the nits I saw before moving onto the next piece of functionality. The result is https://github.com/pubgrub-rs/pubgrub/tree/jf/177-review consider it a "yes and" to the wonderful work in this PR. But I'm not sure on a practical level how to take the next step. Do you want to pull those changes into this PR? Or should I start a new one?

Anyway, I'm gonna go run some benchmarks.

konstin commented 3 months ago

Thanks! I forced pushed your commits to this branch (there's backup branch just in case also) and added a clippy commit on top.

Eh2406 commented 3 months ago

time	Dev	PR	%
sudoku	44s	30s	32%
elm	274ms	277ms	---
large	11ms	10ms	9%
slow_135	19ms	6.3ms	66%
zuse	311 ms	297ms	4.5%
hobo*	1045s	851s	18%

zanieb commented 3 months ago

@mpizenberg does the "real world" improvement of 12s to 3s on the bio_embeddings requirements in uv make this feel more worthwhile? Does that suggest that there is a missing niche in the benchmarks for the project (i.e. that Jacob ran)?

mpizenberg commented 3 months ago

@mpizenberg does the "real world" improvement of 12s to 3s on the bio_embeddings requirements in uv make this feel more worthwhile? Does that suggest that there is a missing niche in the benchmarks for the project (i.e. that Jacob ran)?

Indeed that's a substantial difference!

mpizenberg commented 3 months ago

@mpizenberg does the "real world" improvement of 12s to 3s on the bio_embeddings

Do you know if some of the inlinings produce most of the improvements, or if they all make small improvements that total to the big difference?

konstin commented 3 months ago

Do you know if some of the inlinings produce most of the improvements, or if they all make small improvements that total to the big difference?

This work is entirely profiling and benchmark driven. I started off with two smaller problems, Range::contains being slower than needed and excessive clone/drop in prior_cause. After the first two commits you can see in the flamegraphs in the PR description it's almost entirely union and intersection calls, being slow due to excessive clone/drop (cor... in the flamegraphs). We had previously seen the slow drop/clone and put our version in an Arc, so all you see now are clone/drop on Arcs.

satisfier was calling intersection a lot, but it wasn't interested in the intersection itself, it just wanted to know if the terms are disjoint, so i special cased this into is_disjoint. The union calls were calling intersection once and complement three times, so i unrolled union into a version that clones as few as possible. I don't have numbers on hand for individual changes, but each made a noticeable difference.

Eh2406 commented 3 months ago

I was previously skeptic of this kind of inlining degrading the expressiveness of the computations.

Yesterday I went back and looked at my code from when I originally proposed this. You are right to reject that code. It was inscrutable. The code Konstin has now is much better.

does the "real world" improvement of 12s to 3s on the bio_embeddings requirements in uv make this feel more worthwhile? Does that suggest that there is a missing niche in the benchmarks for the project (i.e. that Jacob ran)?

There is definitely a gap in the benchmarking infrastructure I have available. I can't seem to get a benchmark that has the performance characteristics of your real world data. Help on this would definitely be appreciated. The closest I have is slow_135, which is not nearly as slow as your original report, but has a similar shaped profile. The 66% improvement for that benchmark is nicely similar to the 75% your reporting.

We had previously seen the slow drop/clone and put our version in an Arc, so all you see now are clone/drop on Arcs.

I should be able to run all benchmarks with version wrapped in Arcs. I wonder if that will give me graphs that looked more like yours.

Eh2406 commented 3 months ago

I added two more small optimizations on top for your review https://github.com/pubgrub-rs/pubgrub/tree/jf/177-review

konstin commented 3 months ago

No difference in performance when cherry-picking the three commits on top:

$ taskset -c 0 hyperfine --warmup 1 --runs 3 "./main-uv pip compile scripts/requirements/bio_embeddings.in"  "./branch-uv pip compile scripts/requirements/bio_embeddings.in" 
Benchmark 1: ./main-uv pip compile scripts/requirements/bio_embeddings.in
  Time (mean ± σ):      2.821 s ±  0.007 s    [User: 2.510 s, System: 0.303 s]
  Range (min … max):    2.813 s …  2.827 s    3 runs

Benchmark 2: ./branch-uv pip compile scripts/requirements/bio_embeddings.in
  Time (mean ± σ):      2.826 s ±  0.001 s    [User: 2.527 s, System: 0.291 s]
  Range (min … max):    2.825 s …  2.827 s    3 runs

Summary
  ./main-uv pip compile scripts/requirements/bio_embeddings.in ran
    1.00 ± 0.00 times faster than ./branch-uv pip compile scripts/requirements/bio_embeddings.in

Eh2406 commented 3 months ago

time	Dev	PR	%
sudoku	44s	22s	-50%
elm	274ms	249.07 ms	-9%
large	11ms	9.8821 ms	-7%
slow_135	19ms	6.6884 ms	-63%
zuse	311 ms	285.94 ms	-6%
hobo*	985.2s	803.6s	-18%
arc_elm	381.19 ms	318.25 ms	-16%
arc_large	15.674 ms	14.214 ms	-9%
arc_slow_135	36.262 ms	14.885 ms	-59%
arc_zuse	387.93 ms	340.78 ms	-12%

pubgrub-rs / pubgrub

perf: specialize range operations for better performances #177