Towards state-of-the-art multi-scalar-muls

This is a mirror to the plan I laid out in the Discord #collaborate channel.

Goal:

Be within 15% of the fastest CPU MSMs.

Out-of-scope:

Rayon performance/multithreading performance, I have yet to look at rayon multithreading implementation. For efficient reduction, we would in particular need futures support in Rayon to enable latency-hiding techniques. The final solution will expose parallelism at 3 levels to benefit from CPUs with very high core count.
verification, it's not the main bottleneck, with very involved steps reworking the towering, frobenius for final exponentiation, Miller loop, you can get a 12x perf improvement on verification between the old zcash/bn implementation and an optimized one (see https://github.com/metacraft-labs/DendrETH/issues/115#issuecomment-1537197443, nim-bncurve is a 1-1 port of zcash/bn). I can detail those if wanted.
solve some inherent low-level inefficiencies (account for those potential 15% missed perf):
- field operations return a field element instead of in-place modification. This might be very efficient if the compiler can do RVO (return value optimization) and copy elision, or can be problematic (unsure the compiler can do this with assembly)
- There is significant efficiency gains if we can avoid heap allocation, in particular in batch affine inversion, to avoid stressing the allocator. unsure if alloca(+noinline) is possible in Rust

Reference PRs and benchmark of the changes:

Single-threaded, vs Gnark and BLST: https://github.com/mratsim/constantine/pull/220
Multi-threaded, 8 cores vs Arkworks, Barretenberg, Bellman, Bellman CE, BLST, Gnark: https://github.com/mratsim/constantine/pull/226#issuecomment-1502307328
Multi-threaded, 18 cores vs Barretenberg, Bellman CE, blstrs, Gnark, note that the linked benchmark perf was improved by 50% by changing the multithreaded barrier algorithm so multithreaded performance is highly dependent on the threadpool/rayon implementation https://github.com/mratsim/constantine/pull/227#issuecomment-1508157990
Multithreaded, pasta curves vs https://github.com/supranational/pasta-msm: \ https://github.com/mratsim/constantine/pull/243#issuecomment-1575614261

The steps

[x] (mandatory +35% perf) Accelerate field multiplication.
- Done in https://github.com/privacy-scaling-explorations/halo2curves/pull/49
- Do we need the same in https://github.com/privacy-scaling-explorations/pairing?
[ ] (optional +10% perf) implement extended Jacobian coordinates. Their mixed-addition formula is 10 field mul instead of 11 field mul (see also this march paper from Gnark, Table 3, https://tches.iacr.org/index.php/TCHES/article/download/10972/10279 and BLST codebase). used as a fallback:
- When the MSM is too small for affine addition
- When too many points need to be accumulated in the same bucket and our temp buffer overflows.
[x] (recommended) implement fast inversion. Note that the algorithms initially recommended in https://github.com/privacy-scaling-explorations/halo2curves/issues/28 are asymptotically 4x slower than state-of-the-art (Bernstein-Yang and Pornin's bingcd) and in practice 5x slower. This allows aggregating just ~32 points instead of ~160 points minimum to amortize the cost of inversions when doing batch affine.
[x] (mandatory, 1/2 buckets memory requirement) implement signed digit-recoding. wNAF that everyone uses is problematic, it requires precomputation and allocation of all the recoded scalar ahead of times. This is extra memory usage, extra latency, hard to port to GPU. wNAF has 2 draws: being signed recoding so using 2x less space than unsigned, and optimally maximizing the chance of adding zero, i.e. "no cost". But in MSM that second part basically never happens because the number of bits we look at at once grows up to c=16 (or k=16 in Halo2 usual denomiation). Instead Booth encoding can be used for to provide 1, with no precomputation, drastically reducing memory usage.
[ ] (mandatory, 70% speedup) Architecture MSM to use affine coordinate (6M asymptotic cost with Montgomery's inversion trick). I saw that @Brechtpd in https://github.com/privacy-scaling-explorations/halo2/pull/40 and @kilic in https://github.com/privacy-scaling-explorations/halo2curves/pull/29 are using Barretenberg approach with a radix sort. I think the better high-level approach is:
- the scheduler approach from\ FPGA Acceleration of Multi-Scalar Multiplication: CycloneMSM Kaveh Aasaraai, Don Beaver, Emanuele Cesena, Rahul Maganti, Nicolas Stalder and Javier Varela, 2022 https://eprint.iacr.org/2022/1396.pdf Takeaways: Scheduler idea + collision probability analysis
- combined with parallelism over buckets from Barretenberg's TurboPlonk (without radix sort) or \ cuZK: Accelerating Zero-Knowledge Proof with A Faster Parallel Multi-Scalar Multiplication Algorithm on GPUs\ Tao Lu, Chengkun Wei, Ruijing Yu, Chaochao Chen, Wenjing Fang, Lei Wang, Zeke Wang and Wenzhi Chen, 2022\ https://eprint.iacr.org/2022/1321.pdf Takeaway: Parallelism over buckets by considering MSM as a sparse Matrix-Vector operation.
- and Constantine specific memory and memory-bandwidth optimizations:
  - on-the-fly signed digit recoding with zero allocation
  - no allocation/sorting and very limited temp memory to map points to buckets: 8 bytes per point, max ~600-700 points, per thread.
The main issues of Barrentenberg approach is lots of precomputation, memory usage scales linearly (2x or 3x) with input size, a complex program flow, hard to port to GPU, a lot of cache misses which will necessarily cause bandwidth problems. See detailed analysis at https://gist.github.com/mratsim/27c78c71fd423f731615a91d237162c3#file-multi-scalar-mul-md

see writeup on an alternative: https://github.com/mratsim/constantine/blob/master/constantine/math/elliptic/ec_multi_scalar_mul_scheduler.nim Main takeaway is memory usage scales with the number of threads, like 4kB per extra thread, instead of number of input points.
[ ] (optional, up to 40% speedup below 1024 points, 0% above) Endomorphism acceleration. In my benchmarks, the overhead of endomorphism recoding is not worth it after 2^10 = 1024 points, compared to directly processing those. see my MSMs strategies: https://github.com/mratsim/constantine/blob/151f284/constantine/math/elliptic/ec_multi_scalar_mul_parallel.nim#L532-L565

on-the-fly signed digit recoding with zero allocation

How do you achieve recording on the fly while we need to iterate bucket indexes (ie slices of scalar) in reverse order?

Using Booth encoding, you only need to see the window and the previous bit. https://github.com/mratsim/constantine/blob/c97036d1df09b25afaddc040aed468a80df0c8d7/constantine/math/arithmetic/bigints.nim#L792-L829


func signedWindowEncoding(digit: SecretWord, bitsize: static int): tuple[val: SecretWord, neg: SecretBool] {.inline.} =
  ## Get the signed window encoding for `digit`
  ##
  ## This uses the fact that 999 = 100 - 1
  ## It replaces string of binary 1 with 1...-1
  ## i.e. 0111 becomes 1 0 0 -1
  ##
  ## This looks at [bitᵢ₊ₙ..bitᵢ | bitᵢ₋₁]
  ## and encodes   [bitᵢ₊ₙ..bitᵢ]
  ##
  ## Notes:
  ##   - This is not a minimum weight encoding unlike NAF
  ##   - Due to constant-time requirement in scalar multiplication
  ##     or bucketing large window in multi-scalar-multiplication
  ##     minimum weight encoding might not lead to saving operations
  ##   - Unlike NAF and wNAF encoding, there is no carry to propagate
  ##     hence this is suitable for parallelization without encoding precomputation
  ##     and for GPUs
  ##   - Implementation uses Booth encoding
  result.neg = SecretBool(digit shr bitsize)

  let negMask = -SecretWord(result.neg)
  const valMask = SecretWord((1 shl bitsize) - 1)

  let encode = (digit + One) shr 1            # Lookup bitᵢ₋₁, flip series of 1's
  result.val = (encode + negMask) xor negMask # absolute value
  result.val = result.val and valMask

func getSignedFullWindowAt*(a: BigInt, bitIndex: int, windowSize: static int): tuple[val: SecretWord, neg: SecretBool] {.inline.} =
  ## Access a signed window of `a` of size bitsize
  ## Returns a signed encoding.
  ##
  ## The result is `windowSize` bits at a time.
  ##
  ## bitIndex != 0 and bitIndex mod windowSize == 0
  debug: doAssert (bitIndex != 0) and (bitIndex mod windowSize) == 0
  let digit = a.getWindowAt(bitIndex-1, windowSize+1) # get the bit on the right of the window for Booth encoding
  return digit.signedWindowEncoding(windowSize)

Usually NAF has the following advantages:

The main appeal of windowed NAF is reducing the average Hamming Weight of random scalar from 1/2 (random) to 1/3 (NAF) or even less depending on window size. Meaning there are more zeros so we skip additions.
Negation is cheap for elliptic curves, so with a signed digit representation you don't need to store the additive inverse. Hence you can divide by 2 the precomputed table requirement. Hence a window of size 5 would need 2⁴ = 16 elements instead of 32 if unsigned.

However the main appeal of window NAF is not applicable to MSM because as your window grow, it’s exponentially more likely that at least a bit is set and you will have an addition anyway.

With on-the-fly recoding, you avoid allocating millions of recoded scalar. It's also GPU friendly.

privacy-scaling-explorations / halo2curves

Towards state-of-the-art multi-scalar-muls #163

Reference PRs and benchmark of the changes:

The steps