haskell-crypto / cryptonite

lowlevel set of cryptographic primitives for haskell
Other
226 stars 139 forks source link

Improve Ed448 implementation #84

Closed centromere closed 7 years ago

centromere commented 8 years ago

The current Ed448 implementation was taken from here, however an improved version with better performance is located here.

Unfortunately, due to the way libdecaf is structured, extracting only the x448 operation is not trivial. It is not neatly contained within two files.

vincenthz commented 8 years ago

I had a bit of a look, but it is definitely non-trivial to do

ocheron commented 7 years ago

Has anyone continued working on this?

With some part of this new implementation I managed to replace all of curve25519/448: ocheron:ed448

This is interesting because it gives all EdDSA variants that are currently missing.

ocheron commented 7 years ago

Additional notes:

ocheron commented 7 years ago

Compilation issues with gcc should be fixed now (thanks to a change upstream). I used cacophony benchmarks to see how implementations perform. The improvement for 25519 is modest, but indeed is really good for 448.

Current code:

benchmarking Noise_NN_25519_ChaChaPoly_SHA256
time                 297.2 μs   (296.9 μs .. 297.6 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 297.3 μs   (296.9 μs .. 297.6 μs)
std dev              1.116 μs   (912.5 ns .. 1.384 μs)

benchmarking Noise_NN_448_ChaChaPoly_SHA256
time                 2.047 ms   (2.043 ms .. 2.050 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 2.053 ms   (2.049 ms .. 2.057 ms)
std dev              12.51 μs   (10.10 μs .. 15.81 μs)

New code:

benchmarking Noise_NN_25519_ChaChaPoly_SHA256
time                 272.2 μs   (271.9 μs .. 272.5 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 272.5 μs   (272.2 μs .. 272.8 μs)
std dev              959.8 ns   (800.6 ns .. 1.207 μs)

benchmarking Noise_NN_448_ChaChaPoly_SHA256
time                 622.1 μs   (621.4 μs .. 622.7 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 622.3 μs   (621.8 μs .. 623.4 μs)
std dev              2.397 μs   (1.827 μs .. 3.969 μs)

New code, with flags support_avx2 and support_bmi2:

benchmarking Noise_NN_25519_ChaChaPoly_SHA256
time                 258.5 μs   (258.2 μs .. 258.7 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 258.8 μs   (258.5 μs .. 259.0 μs)
std dev              840.9 ns   (669.3 ns .. 1.076 μs)

benchmarking Noise_NN_448_ChaChaPoly_SHA256
time                 601.1 μs   (600.7 μs .. 601.6 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 601.8 μs   (601.1 μs .. 602.8 μs)
std dev              2.589 μs   (2.048 μs .. 3.620 μs)

Tests are also successful on Linux 32 bits and Windows 8.1 64 bits.

Code size is large but about the same of what we already have with ed25519-donna:

 48K   cbits/curve25519
336K   cbits/ed25519
 12K   cbits/ed448

104K   cbits/decaf/curve25519
136K   cbits/decaf/ed448goldilocks
140K   cbits/decaf/include
 48K   cbits/decaf/p25519
 48K   cbits/decaf/p448
vincenthz commented 7 years ago

This is really nice numbers; I'm worried about environment, compiler and cabal issues related to using advanced flags like avx, etc.

ocheron commented 7 years ago

Yes I'm worried too. Since my previous comment I tried building on OpenBSD 5.9. Compilation fails, because of a missing Intel intrinsics header file. I'll investigate with a lower priority. This probably can wait a little. Maybe there will be some simplifications upstream or progress elsewhere. There is also a ref64 architecture that could help.

ocheron commented 7 years ago

Giving it another try I have better understanding of of what happens on OpenBSD. I don't think we can get arch_x86_64 to work there, however performance of arch_ref64, with no inline assembly, is already a good improvement:

benchmarking Noise_NN_448_ChaChaPoly_SHA256
time                 828.8 μs   (827.8 μs .. 829.8 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 828.5 μs   (827.5 μs .. 829.8 μs)
std dev              4.043 μs   (3.272 μs .. 5.134 μs)

With arch_ref64 performance of 25519 would be must worse than what we have today, but we can very well keep curve25519-donna. Missing variants like Ed25519ctx and Ed25519ph don't get much success for TLS and certificates anyway.

So I should be able to finalize a PR which is 448-only and with no build feature more advanced that what we already have.

ocheron commented 7 years ago

PR merged with 32/64-bit reference implementations.

@centromere Did you test it too? Speed improvement is supposed to be at least x2.

centromere commented 7 years ago

At first glance it looks good, but unfortunately I am finding it very difficult to get low-variance results.

Before your changes:

benchmarking Noise_NN_448_ChaChaPoly_SHA256
time                 2.628 ms   (2.602 ms .. 2.645 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 2.667 ms   (2.639 ms .. 2.770 ms)
std dev              156.5 μs   (40.14 μs .. 320.1 μs)
variance introduced by outliers: 41% (moderately inflated)

After your changes:

benchmarking Noise_NN_448_ChaChaPoly_SHA256
time                 914.5 μs   (905.9 μs .. 919.4 μs)
                     0.999 R²   (0.996 R² .. 1.000 R²)
mean                 930.6 μs   (919.0 μs .. 975.6 μs)
std dev              74.87 μs   (3.439 μs .. 144.5 μs)
variance introduced by outliers: 64% (severely inflated)
ocheron commented 7 years ago

Thanks for looking. The variance reminds me what I saw on a laptop. I guess some platforms are not good for benchmarks.