benchmark P256.pointAdd and ECC.pointAdd

3for commented 5 years ago

I'm benching the P256.hs and Ecc.hs pointAdd function. It's shown that pointAdd in P256 ( 1.131 ms) is much less efficient than in Ecc (3.961 μs), while pointMul in P256 is more efficient than in Ecc. Is there anything wrong with my bench code?

benchECC =
    [ bench "pointAddTwoMuls-baseline"  $ nf run_b (n1, p1, n2, p2)
    , bench "pointAddTwoMuls-optimized" $ nf run_o (n1, p1, n2, p2)
    , bench "pointAdd-ECC" $ nf run_c (p1, p2)
    , bench "pointMul-ECC" $ nf run_d (n1, p2)
    ]
  where run_b (n, p, k, q) = ECC.pointAdd c (ECC.pointMul c n p)
                                            (ECC.pointMul c k q)

        run_o (n, p, k, q) = ECC.pointAddTwoMuls c n p k q
        run_c (p, q) = ECC.pointAdd c p q
        run_d (n, p) = ECC.pointMul c n p

        c  = ECC.getCurveByName ECC.SEC_p256r1
        r1 = 7
        r2 = 11
        -- p1 = ECC.pointBaseMul c r1
        -- p2 = ECC.pointBaseMul c r2
        p1 = ECC.pointBaseMul c n1
        p2 = ECC.pointBaseMul c n2
        n1 = 0x2ba9daf2363b2819e69b34a39cf496c2458a9b2a21505ea9e7b7cbca42dc7435
        n2 = 0xf054a7f60d10b8c2cf847ee90e9e029f8b0e971b09ca5f55c4d49921a11fadc1

benchP256 =
    [ bench "pointAddTwoMuls-P256"  $ nf run_p (n1, s, n2, t)
    , bench "pointAdd-P256"  $ nf run_q (s, t) 
    , bench "pointMul-P256"  $ nf run_t (n1, s) 
    ]
  where run_p (n1, s, n2, t) = P256.pointAdd (P256.pointMul n1 s) (P256.pointMul n2 t)
        run_q (s, t) = P256.pointAdd s t
        run_t (n1, s) = P256.pointMul n1 s

        xS = 0xde2444bebc8d36e682edd27e0f271508617519b3221a8fa0b77cab3989da97c9
        yS = 0xc093ae7ff36e5380fc01a5aad1e66659702de80f53cec576b6350b243042a256
        xT = 0x55a8b00f8da1d44e62f6b3b25316212e39540dc861c89575bb8cf92e35e0986b
        yT = 0x5421c3209c2d6c704835d82ac4c3dd90f61a8a52598b9e7ab656e9d8c8b24316
        s = P256.pointFromIntegers (xS, yS)
        t = P256.pointFromIntegers (xT, yT)
        r1 = 
            case P256.scalarFromInteger 7 of 
                CryptoFailed err    -> error ("cannot convert scalar: " ++ show err)
                CryptoPassed scalar -> scalar

        r2 = 
            case P256.scalarFromInteger 11 of 
                CryptoFailed err    -> error ("cannot convert scalar: " ++ show err)
                CryptoPassed scalar -> scalar

        -- s = P256.pointMul r1 P256.pointBase
        -- t = P256.pointMul r2 P256.pointBase
        n1 = 
            let a = 0x2ba9daf2363b2819e69b34a39cf496c2458a9b2a21505ea9e7b7cbca42dc7435
             in case P256.scalarFromInteger a of
                            CryptoFailed err    -> error ("cannot convert scalar: " ++ show err)
                            CryptoPassed scalar -> scalar
        n2 = 
            let b = 0xf054a7f60d10b8c2cf847ee90e9e029f8b0e971b09ca5f55c4d49921a11fadc1
             in case P256.scalarFromInteger b of
                            CryptoFailed err    -> error ("cannot convert scalar: " ++ show err)
                            CryptoPassed scalar -> scalar

bench result is:

benchmarked ECC/pointAddTwoMuls-baseline
time                 5.404 ms   (4.558 ms .. 6.660 ms)
                     0.826 R²   (0.757 R² .. 0.992 R²)
mean                 5.183 ms   (4.837 ms .. 5.757 ms)
std dev              1.291 ms   (822.2 μs .. 1.849 ms)
variance introduced by outliers: 91% (severely inflated)

benchmarked ECC/pointAddTwoMuls-optimized
time                 2.543 ms   (2.432 ms .. 2.654 ms)
                     0.985 R²   (0.972 R² .. 0.995 R²)
mean                 3.422 ms   (2.941 ms .. 4.578 ms)
std dev              2.547 ms   (983.8 μs .. 5.538 ms)
variance introduced by outliers: 98% (severely inflated)

benchmarked ECC/pointAdd-ECC
time                 3.961 μs   (3.943 μs .. 3.979 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 3.988 μs   (3.974 μs .. 4.013 μs)
std dev              61.50 ns   (40.65 ns .. 100.1 ns)

benchmarked ECC/pointMul-ECC
time                 2.227 ms   (2.148 ms .. 2.295 ms)
                     0.990 R²   (0.979 R² .. 0.998 R²)
mean                 2.578 ms   (2.404 ms .. 2.878 ms)
std dev              823.7 μs   (461.4 μs .. 1.175 ms)
variance introduced by outliers: 96% (severely inflated)

benchmarked P256/pointAddTwoMuls-P256
time                 2.903 ms   (1.650 ms .. 3.975 ms)
                     0.675 R²   (0.595 R² .. 0.979 R²)
mean                 2.349 ms   (2.020 ms .. 3.197 ms)
std dev              1.698 ms   (762.4 μs .. 3.235 ms)
variance introduced by outliers: 98% (severely inflated)

benchmarked P256/pointAdd-P256
time                 1.131 ms   (798.2 μs .. 1.462 ms)
                     0.742 R²   (0.680 R² .. 0.964 R²)
mean                 845.9 μs   (778.8 μs .. 974.9 μs)
std dev              300.5 μs   (154.5 μs .. 482.7 μs)
variance introduced by outliers: 97% (severely inflated)

benchmarked P256/pointMul-P256
time                 796.3 μs   (541.4 μs .. 1.078 ms)
                     0.633 R²   (0.472 R² .. 0.772 R²)
mean                 730.9 μs   (634.6 μs .. 837.6 μs)
std dev              317.4 μs   (254.0 μs .. 402.4 μs)
variance introduced by outliers: 97% (severely inflated)

ocheron commented 5 years ago

Nothing wrong, P256.pointAdd is simply not good.

Contributions to improve are welcome, including benchmark code showing the issue (but can be simplified with throwCryptoError).

3for commented 5 years ago

P256 benchmark code added in Pull Requst 288.

ocheron commented 5 years ago

Regarding execution time, the current P256.pointAdd is especially slow because it uses scalar multiplication calls to convert both inputs to projective coordinates. Instead Z coordinate could just be set to kOne.

And the call to point_add_or_double_vartime could be changed to a function specialized for Z=1 like http://www.hyperelliptic.org/EFD/g1p/auto-shortw-jacobian-3.html#addition-mmadd-2007-bl. This will remove some more field operations.

In the end I don't think this can beat ECC performance because P256 is only 32 bits. But performance will be much better than today.

3for commented 5 years ago

@ocheron Thanks so much for your link. It's great. And I've removed the warnings following your advice in the Pull Requst 288.

ocheron commented 5 years ago

Fixed in #291: P256.pointAdd does not use scalar multiplication anymore and is faster than P256.pointMul.

P256/pointAddTwoMuls-P256                mean 898.9 μs  ( +- 865.8 ns  )
P256/pointAdd-P256                       mean 27.98 μs  ( +- 45.26 ns  )
P256/pointMul-P256                       mean 435.6 μs  ( +- 744.6 ns  )

P256.pointAdd is still slower than ECC.pointAdd, mostly due to doing field inversion in constant time (whereas ECC.pointAdd uses variable-time Euclid).

haskell-crypto / cryptonite

benchmark P256.pointAdd and ECC.pointAdd #286