Endomorphism-accelerated scalar multiplication

PR #28 introduced an efficient constant-time scalar multiplication scheme, it is over 10% faster than naive double-and-add (with scratchspace of 16 curve point) even though it is constant-time

The BN and BLS curve family among many others) allows using GLV (Gallant-Lambert-Vanstone) and GLS (Galbraith- Lin-Scott) decompositions to speed up scalar multiplication even more on G1 (via GLV with 2 dimensional decomposition) and G2 (via combined GLV+GLS).

This works by decomposing an elliptic curve point on a Lattice-base and doing multi-scalar multiplication.

The main cost of scalar multiplication comes from iterating on each bit of the scalar:

for BN254_Snarks 254 iterations
for BLS12_381, 255 iterations (as we work on a curve of with an order of bitwidth 255) with each bit requiring a doubling and an addition (in a constant-time implementation)

A 2-dimensional GLV can half the number of doublings by deconposing the scalar and smaller exponents.

References

Joppe W. Bos, Craig Costello, and Michael Naehrig.\ Scalar Multiplication and Exponentiation in Pairing Groups\ Chapter 6 of Guide to Pairing-Based Cryptography (El Mrabet, Joye et al)
Software implementation of an Attribute-BasedEncryption scheme\ Eric Zavattoni, Luis J. Dominguez Perez, Shigeo Mitsunari, AnaH. Sanchez-Ramirez, Tadanori Teruya, andFrancisco Rodriguez-Henriquez, 2015 https://eprint.iacr.org/2014/401.pdf
Joppe W. Bos, Craig Costello, and Michael Naehrig. Exponentiating in pairing groups. In T. Lange, K. Lauter, and P. Lisonek, editors, Selected Areas in Cryptography – SAC 2013, volume 8282 of Lecture Notes in Computer Science, pp. 438–455. Springer, Heidelberg, 2014.\ https://eprint.iacr.org/2013/458

On scalar decomposition see also:

Easy scalar decompositionsfor efficient scalar multiplicationon elliptic curves and genus 2 Jacobians\ Benjamin Smith, 2013\ https://arxiv.org/pdf/1310.5250.pdf
FourQ: four-dimensional decompositions on aQ-curveover the Mersenne prime\ Craig Costello and Patrick Longa, 2015\ https://eprint.iacr.org/2015/565.pdf

And implementation of Barrentenberg: https://github.com/AztecProtocol/barretenberg/blob/091115fd5ef22c56565454ee962312b8516296f6/barretenberg/src/aztec/ecc/fields/field.hpp#L249-L319

    /**
     * For short Weierstrass curves y^2 = x^3 + b mod r, if there exists a cube root of unity mod r,
     * we can take advantage of an enodmorphism to decompose a 254 bit scalar into 2 128 bit scalars.
     * \beta = cube root of 1, mod q (q = order of fq)
     * \lambda = cube root of 1, mod r (r = order of fr)
     *
     * For a point P1 = (X, Y), where Y^2 = X^3 + b, we know that
     * the point P2 = (X * \beta, Y) is also a point on the curve
     * We can represent P2 as a scalar multiplication of P1, where P2 = \lambda * P1
     *
     * For a generic multiplication of P1 by a 254 bit scalar k, we can decompose k
     * into 2 127 bit scalars (k1, k2), such that k = k1 - (k2 * \lambda)
     *
     * We can now represent (k * P1) as (k1 * P1) - (k2 * P2), where P2 = (X * \beta, Y).
     * As k1, k2 have half the bit length of k, we have reduced the number of loop iterations of our
     * scalar multiplication algorithm in half
     *
     * To find k1, k2, We use the extended euclidean algorithm to find 4 short scalars [a1, a2], [b1, b2] such that
     * modulus = (a1 * b2) - (b1 * a2)
     * We then compube scalars c1 = round(b2 * k / r), c2 = round(b1 * k / r), where
     * k1 = (c1 * a1) + (c2 * a2), k2 = -((c1 * b1) + (c2 * b2))
     * We pre-compute scalars g1 = (2^256 * b1) / n, g2 = (2^256 * b2) / n, to avoid having to perform long division
     * on 512-bit scalars
     **/
    static void split_into_endomorphism_scalars(const field& k, field& k1, field& k2)
    {
        field input = k.reduce_once();
        // uint64_t lambda_reduction[4] = { 0 };
        // __to_montgomery_form(lambda, lambda_reduction);

        // TODO: these parameters only work for the bn254 coordinate field.
        // Need to shift into Params and calculate correct constants for the subgroup field
        constexpr field endo_g1 = { Params::endo_g1_lo, Params::endo_g1_mid, Params::endo_g1_hi, 0 };

        constexpr field endo_g2 = { Params::endo_g2_lo, Params::endo_g2_mid, 0, 0 };

        constexpr field endo_minus_b1 = { Params::endo_minus_b1_lo, Params::endo_minus_b1_mid, 0, 0 };

        constexpr field endo_b2 = { Params::endo_b2_lo, Params::endo_b2_mid, 0, 0 };

        // compute c1 = (g2 * k) >> 256
        wide_array c1 = endo_g2.mul_512(input);
        // compute c2 = (g1 * k) >> 256
        wide_array c2 = endo_g1.mul_512(input);

        // (the bit shifts are implicit, as we only utilize the high limbs of c1, c2

        // TODO remove data duplication
        field c1_hi = {
            c1.data[4], c1.data[5], c1.data[6], c1.data[7]
        }; // *(field*)((uintptr_t)(&c1) + (4 * sizeof(uint64_t)));
        field c2_hi = {
            c2.data[4], c2.data[5], c2.data[6], c2.data[7]
        }; // *(field*)((uintptr_t)(&c2) + (4 * sizeof(uint64_t)));

        // compute q1 = c1 * -b1
        wide_array q1 = c1_hi.mul_512(endo_minus_b1);
        // compute q2 = c2 * b2
        wide_array q2 = c2_hi.mul_512(endo_b2);

        // TODO: this doesn't have to be a 512-bit multiply...
        field q1_lo{ q1.data[0], q1.data[1], q1.data[2], q1.data[3] };
        field q2_lo{ q2.data[0], q2.data[1], q2.data[2], q2.data[3] };

        field t1 = (q2_lo - q1_lo).reduce_once();
        field t2 = (t1 * beta() + input).reduce_once();
        k2.data[0] = t1.data[0];
        k2.data[1] = t1.data[1];
        k1.data[0] = t2.data[0];
        k1.data[1] = t2.data[1];
    }

mratsim / constantine

Endomorphism-accelerated scalar multiplication #32

References