Potential optimization for `limbs_mul_greater_to_out_basecase`

florian1345 commented 5 months ago

This is another potential optimization I found while profiling an application focused on large GCDs. For me, it yields a performance improvement of approximately 6% for GCDs on the order of ~1 MiB operands. Interestingly, multiplication at a similar size profits somewhat less (only 3-4% IIRC). Numbers are on stable-x86_64-pc-windows-msvc compiled for target-CPU x86-64-v3. This time I benched again using WSL with stable-x86_64-unknown-linux-gnu and got reasonably similar results - still running on the same Windows host machine, though.

The optimization is done in two parts: First, I made limbs_slice_add_mul_limb_same_length_in_place_left more similar to the corresponding GMP implementation, which yielded a performance improvement of approx. 2.6%. The rest was obtained by processing two limbs from the ys slice in limbs_mul_greater_to_out_basecase at the same time. GMP has platform-specific assembly for this, but even just implementing it regularly seems to improve performance (perhaps better locality?).

mhogrefe commented 4 months ago

Released as part of v0.4.13. This time I was able to see an improvement with my own benchmarks! Thank you!

mhogrefe commented 4 months ago

@florian1345, you may be interested to know that after I merging your pull requests, I asked the maintainer of bigint-benchmark-rs to rerun their benchmarks. Malachite performed noticeably better than before. Before, the ratio of Malachite times to rug times (rug is a wrapper over GMP) was

e 100k	e 1m	e 10m	fib 10m	fib 100m	fib_hex 100m
3.07	3.61	3.69	4.15	4.33	3.82

and after your improvements it's

e 100k	e 1m	e 10m	fib 10m	fib 100m	fib_hex 100m
1.33	1.3	1.323	1.789	1.768	1.623

In other words, Malachite used to take 3 or 4 times as much time as rug/GMP, and now takes less than twice as much time. Thanks again for your changes, and please submit more if you think of any!

mhogrefe / malachite

Potential optimization for `limbs_mul_greater_to_out_basecase` #49