Closed florian1345 closed 5 months ago
Released as part of v0.4.13. This time I was able to see an improvement with my own benchmarks! Thank you!
@florian1345, you may be interested to know that after I merging your pull requests, I asked the maintainer of bigint-benchmark-rs to rerun their benchmarks. Malachite performed noticeably better than before. Before, the ratio of Malachite times to rug
times (rug
is a wrapper over GMP) was
e 100k | e 1m | e 10m | fib 10m | fib 100m | fib_hex 100m |
---|---|---|---|---|---|
3.07 | 3.61 | 3.69 | 4.15 | 4.33 | 3.82 |
and after your improvements it's
e 100k | e 1m | e 10m | fib 10m | fib 100m | fib_hex 100m |
---|---|---|---|---|---|
1.33 | 1.3 | 1.323 | 1.789 | 1.768 | 1.623 |
In other words, Malachite used to take 3 or 4 times as much time as rug/GMP, and now takes less than twice as much time. Thanks again for your changes, and please submit more if you think of any!
This is another potential optimization I found while profiling an application focused on large GCDs. For me, it yields a performance improvement of approximately 6% for GCDs on the order of ~1 MiB operands. Interestingly, multiplication at a similar size profits somewhat less (only 3-4% IIRC). Numbers are on
stable-x86_64-pc-windows-msvc
compiled for target-CPUx86-64-v3
. This time I benched again using WSL withstable-x86_64-unknown-linux-gnu
and got reasonably similar results - still running on the same Windows host machine, though.The optimization is done in two parts: First, I made
limbs_slice_add_mul_limb_same_length_in_place_left
more similar to the corresponding GMP implementation, which yielded a performance improvement of approx. 2.6%. The rest was obtained by processing two limbs from theys
slice inlimbs_mul_greater_to_out_basecase
at the same time. GMP has platform-specific assembly for this, but even just implementing it regularly seems to improve performance (perhaps better locality?).