Closed centromere closed 7 years ago
I had a bit of a look, but it is definitely non-trivial to do
Has anyone continued working on this?
With some part of this new implementation I managed to replace all of curve25519/448: ocheron:ed448
This is interesting because it gives all EdDSA variants that are currently missing.
Additional notes:
-mbmi2
is required)Compilation issues with gcc should be fixed now (thanks to a change upstream). I used cacophony benchmarks to see how implementations perform. The improvement for 25519 is modest, but indeed is really good for 448.
Current code:
benchmarking Noise_NN_25519_ChaChaPoly_SHA256
time 297.2 μs (296.9 μs .. 297.6 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 297.3 μs (296.9 μs .. 297.6 μs)
std dev 1.116 μs (912.5 ns .. 1.384 μs)
benchmarking Noise_NN_448_ChaChaPoly_SHA256
time 2.047 ms (2.043 ms .. 2.050 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 2.053 ms (2.049 ms .. 2.057 ms)
std dev 12.51 μs (10.10 μs .. 15.81 μs)
New code:
benchmarking Noise_NN_25519_ChaChaPoly_SHA256
time 272.2 μs (271.9 μs .. 272.5 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 272.5 μs (272.2 μs .. 272.8 μs)
std dev 959.8 ns (800.6 ns .. 1.207 μs)
benchmarking Noise_NN_448_ChaChaPoly_SHA256
time 622.1 μs (621.4 μs .. 622.7 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 622.3 μs (621.8 μs .. 623.4 μs)
std dev 2.397 μs (1.827 μs .. 3.969 μs)
New code, with flags support_avx2 and support_bmi2:
benchmarking Noise_NN_25519_ChaChaPoly_SHA256
time 258.5 μs (258.2 μs .. 258.7 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 258.8 μs (258.5 μs .. 259.0 μs)
std dev 840.9 ns (669.3 ns .. 1.076 μs)
benchmarking Noise_NN_448_ChaChaPoly_SHA256
time 601.1 μs (600.7 μs .. 601.6 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 601.8 μs (601.1 μs .. 602.8 μs)
std dev 2.589 μs (2.048 μs .. 3.620 μs)
Tests are also successful on Linux 32 bits and Windows 8.1 64 bits.
Code size is large but about the same of what we already have with ed25519-donna:
48K cbits/curve25519
336K cbits/ed25519
12K cbits/ed448
104K cbits/decaf/curve25519
136K cbits/decaf/ed448goldilocks
140K cbits/decaf/include
48K cbits/decaf/p25519
48K cbits/decaf/p448
This is really nice numbers; I'm worried about environment, compiler and cabal issues related to using advanced flags like avx, etc.
Yes I'm worried too. Since my previous comment I tried building on OpenBSD 5.9. Compilation fails, because of a missing Intel intrinsics header file. I'll investigate with a lower priority. This probably can wait a little. Maybe there will be some simplifications upstream or progress elsewhere. There is also a ref64 architecture that could help.
Giving it another try I have better understanding of of what happens on OpenBSD. I don't think we can get arch_x86_64 to work there, however performance of arch_ref64, with no inline assembly, is already a good improvement:
benchmarking Noise_NN_448_ChaChaPoly_SHA256
time 828.8 μs (827.8 μs .. 829.8 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 828.5 μs (827.5 μs .. 829.8 μs)
std dev 4.043 μs (3.272 μs .. 5.134 μs)
With arch_ref64 performance of 25519 would be must worse than what we have today, but we can very well keep curve25519-donna. Missing variants like Ed25519ctx and Ed25519ph don't get much success for TLS and certificates anyway.
So I should be able to finalize a PR which is 448-only and with no build feature more advanced that what we already have.
PR merged with 32/64-bit reference implementations.
@centromere Did you test it too? Speed improvement is supposed to be at least x2.
At first glance it looks good, but unfortunately I am finding it very difficult to get low-variance results.
Before your changes:
benchmarking Noise_NN_448_ChaChaPoly_SHA256
time 2.628 ms (2.602 ms .. 2.645 ms)
0.999 R² (0.997 R² .. 1.000 R²)
mean 2.667 ms (2.639 ms .. 2.770 ms)
std dev 156.5 μs (40.14 μs .. 320.1 μs)
variance introduced by outliers: 41% (moderately inflated)
After your changes:
benchmarking Noise_NN_448_ChaChaPoly_SHA256
time 914.5 μs (905.9 μs .. 919.4 μs)
0.999 R² (0.996 R² .. 1.000 R²)
mean 930.6 μs (919.0 μs .. 975.6 μs)
std dev 74.87 μs (3.439 μs .. 144.5 μs)
variance introduced by outliers: 64% (severely inflated)
Thanks for looking. The variance reminds me what I saw on a laptop. I guess some platforms are not good for benchmarks.
The current Ed448 implementation was taken from here, however an improved version with better performance is located here.
Unfortunately, due to the way libdecaf is structured, extracting only the x448 operation is not trivial. It is not neatly contained within two files.