mhogrefe / malachite

An arbitrary-precision arithmetic library for Rust.
GNU Lesser General Public License v3.0
453 stars 18 forks source link

Potential optimization for `limbs_mul_greater_to_out_basecase` #49

Closed florian1345 closed 5 months ago

florian1345 commented 5 months ago

This is another potential optimization I found while profiling an application focused on large GCDs. For me, it yields a performance improvement of approximately 6% for GCDs on the order of ~1 MiB operands. Interestingly, multiplication at a similar size profits somewhat less (only 3-4% IIRC). Numbers are on stable-x86_64-pc-windows-msvc compiled for target-CPU x86-64-v3. This time I benched again using WSL with stable-x86_64-unknown-linux-gnu and got reasonably similar results - still running on the same Windows host machine, though.

The optimization is done in two parts: First, I made limbs_slice_add_mul_limb_same_length_in_place_left more similar to the corresponding GMP implementation, which yielded a performance improvement of approx. 2.6%. The rest was obtained by processing two limbs from the ys slice in limbs_mul_greater_to_out_basecase at the same time. GMP has platform-specific assembly for this, but even just implementing it regularly seems to improve performance (perhaps better locality?).

mhogrefe commented 4 months ago

Released as part of v0.4.13. This time I was able to see an improvement with my own benchmarks! Thank you!

mhogrefe commented 4 months ago

@florian1345, you may be interested to know that after I merging your pull requests, I asked the maintainer of bigint-benchmark-rs to rerun their benchmarks. Malachite performed noticeably better than before. Before, the ratio of Malachite times to rug times (rug is a wrapper over GMP) was

e 100k e 1m e 10m fib 10m fib 100m fib_hex 100m
3.07 3.61 3.69 4.15 4.33 3.82

and after your improvements it's

e 100k e 1m e 10m fib 10m fib 100m fib_hex 100m
1.33 1.3 1.323 1.789 1.768 1.623

In other words, Malachite used to take 3 or 4 times as much time as rug/GMP, and now takes less than twice as much time. Thanks again for your changes, and please submit more if you think of any!