Implement CIOS for ARM F::mul

sragss commented 10 months ago

Implements CIOS for Montgomery 256-bit field multiplication. Specifically the fast variant (algorithm 2). These changes are particularly relevant on ARM where we do not have x86 / BMI2 / AVX512 and the associated assembly backend. There does not appear to be a NEON equivalent for MULX / ADOX / ADCX.

Accelerates 7-15%, roughly reaching parity with Arkworks on ARM. See sragss/speedy-fields to benchmark. cargo run --release is more consistent than cargo bench due to matching elements.

Currently bn256::test::test_consistent_hash_to_curve fails due to an attempt to multiply a number larger than both the montgomery radix and the field modulus. This results in a 1-bit difference between implementations. Specifically:

let lhs: [u64; 4] = [
        0x338ea436d0dbaca2,
        0x4439f510abdd1b04,
        0x5ee775ab94064231,
        0xf04dfd6e270d4842
    ];

I'm not sure if this was intended to be supported. Other libraries (such as Arkworks) do not handle numbers outside of the range of the modulus. Any elements created via F::new() undergo field multiplication (by R2) where they're brought into the proper range.

This can be handled by checking if the inputs are greater than the field modulus and subtracting in advance but there's some non-zero cost (2-5%) to performing the check.

                    let mut lhs = self.0;
                    if bigint_geq(&lhs, &$modulus.0) {
                        let mut borrow = 0;
                        (lhs[0], borrow) = sbb(lhs[0], $modulus.0[0], borrow);
                        (lhs[1], borrow) = sbb(lhs[1], $modulus.0[1], borrow);
                        (lhs[2], borrow) = sbb(lhs[2], $modulus.0[2], borrow);
                        (lhs[3], borrow) = sbb(lhs[3], $modulus.0[3], borrow);
                    }

duguorong009 commented 9 months ago

Thanks for the PR! @sragss

I think we should add the input range check in the mul function as you suggested. Even though it comes with some cost, I believe it is the only way to introduce the CIOS algo to the repo, at the moment.

(Normally, I use the F::from_raw function for input range check.)

What do you think? @CPerezz @davidnevadoc

sragss commented 9 months ago

Fixed comments @davidnevadoc – Let me know what you think about the operands outside the normal range.

We can add the reduction before, or can update the bn256::test::test_consistent_hash_to_curve test which uses the hash_to_curve outside the modulus range. I suspect this test has bad data due to random generation by a script (~25% chance of generating out of the range) rather than intentional usage.

davidnevadoc commented 9 months ago

In regards to the outside of range operands, the approach I like is controlling the ways in which we create field elements and then assuming they are in the appropriate ranges in all operations. For this concrete case, the problem was that the multiplication in from_u512 non-asm version was performing a multiplication on an unchecked field element that could be any 256 bits. https://github.com/privacy-scaling-explorations/halo2curves/blob/3c43d3c0c3b697c411cfd07f026798998f55073a/src/derive/field.rs#L159

The asm version on the other hand, was using Self::montgomery_reduction to handle this conversion from bytes to a valid field element. I think this should be the approach for the non-asm version as well. I propose a solution along this lines: https://github.com/privacy-scaling-explorations/halo2curves/commit/359619aabd967a820ebda9a3aee3db8aeda12139

I have modified the montgomery_reduction function to have a non-asm version that performs the check @sragss suggested and then multiplies by the appropriate R.

Let me know what you think and feel free to add the change if you like it.

sragss commented 9 months ago

Agree with your approach – added those changes. All tests are passing now and we don't have the slowdown from adding the check to mul.

privacy-scaling-explorations / halo2curves

Implement CIOS for ARM F::mul #134