Essentially, if all coefficients a_i are smaller than some value x < p (with p the prime field order), their binary representations will have leading zero bits. In this case, we can skip processing the windows corresponding to these leading zeros in the bucket calculation since they would not contribute to the final result.
After writing a benchmark comparing the old implementation using all windows vs. the optimization (and hence added a new argument that currently breaks the CI), I noticed that the performance improvements are pretty minor. In fact, even without this optimization our improvements are massive for inputs with only small coefficients. The gains beyond that of skipping windows only provides an additional marginal improvement.
Here is a plot showing the time for an MSM with a different number of points and number of set bits. Compared is the reference implementation (using b prefix for baseline) with the optimized implementation. Both split by either using all windows or only using non zero leading windows. What is actually quite interesting is that for the reference implementation the improvements generally are much larger, pushing it in line with the optimized implementation.
NOTE: CI is broken, because I temporarily added a parameter
useZeroWindows
to the MSM procs.This implements the same optimization as done in:
https://github.com/privacy-scaling-explorations/halo2curves/pull/168
for the single threaded MSM implementations.
Essentially, if all coefficients
a_i
are smaller than some valuex < p
(withp
the prime field order), their binary representations will have leading zero bits. In this case, we can skip processing the windows corresponding to these leading zeros in the bucket calculation since they would not contribute to the final result.After writing a benchmark comparing the old implementation using all windows vs. the optimization (and hence added a new argument that currently breaks the CI), I noticed that the performance improvements are pretty minor. In fact, even without this optimization our improvements are massive for inputs with only small coefficients. The gains beyond that of skipping windows only provides an additional marginal improvement.
Here is a plot showing the time for an MSM with a different number of points and number of set bits. Compared is the reference implementation (using
b
prefix for baseline) with the optimized implementation. Both split by either using all windows or only using non zero leading windows. What is actually quite interesting is that for the reference implementation the improvements generally are much larger, pushing it in line with the optimized implementation.(I ran the bench on my laptop with a i7-8750H)