Closed TomMD closed 11 years ago
It seems a bit noisy, but my previous AES work, based on an older version of your NI and gladmans AES with a trampoline, does seem to show an improvement so perhaps I'll port a patch.
AES128_tmd: 120.4667 ns 126.5365 ns 174.2186 ns 349.0955 ns 588.7610 ns 2.029187 us 7.677496 us
126.7 M/s 241.2 M/s 700.7 M/s 1398.7 M/s 1658.7 M/s 1925.0 M/s 2035.2 M/s
AES128: 169.6261 ns 179.7410 ns 207.4656 ns 358.6883 ns 541.5221 ns 1.690435 us 6.132310 us
90.0 M/s 169.8 M/s 588.4 M/s 1361.3 M/s 1803.4 M/s 2310.8 M/s 2548.0 M/s
i'm not sure what your benchmark is benchmarking exactly, but 2 things on my list are:
is showing the performance impact of my main difference from your cipher-aes pacakage is the trampolines. By 'trampolines' I am referring to the C routines and function pointers that first detect if NI is available then set the function pointer to always use NI (or not) without any further checks. This is in contrast to dynamically checking for NI on every single AES call.
void *(*aes_encrypt_ecb_ptr(key *k, uint8_t* out, uint8_t* in, uint32_t nrBlocks)
= &detect_ni_and_encrypt;
detect_ni_and_encrypt(...)
{
if(has_ni()) aes_encrypt_ecb_ptr = &aes_ni_encrypt_ecb;
(*aes_encrypt_ecb_ptr)(...);
}
aes_encrypt_ecb(...)
{
(*aes_encrypt_ecb_ptr)(...);
}
After updating to use your latest NI code, the trampoline method compares favorably to dynamically checking at every single call. In the below the most sensitive benchmark is the 16byte case (because the time to check for NI is the only difference in the computation, the actual cipher routine is identical):
cipher| 16 bytes 32 bytes 64 bytes 512 bytes 1024 bytes 4096 bytes 16384 bytes
==================================================================================
AES128_tmd: 123.2306 ns 125.3117 ns 161.5419 ns 312.6124 ns 510.0473 ns 1.688639 us 6.348553 us
123.8 M/s 243.5 M/s 755.7 M/s 1561.9 M/s 1914.7 M/s 2313.3 M/s 2461.2 M/s
AES128: 179.7151 ns 180.8886 ns 215.2212 ns 367.9218 ns 569.0574 ns 1.758038 us 6.448365 us
84.9 M/s 168.7 M/s 567.2 M/s 1327.1 M/s 1716.1 M/s 2221.9 M/s 2423.1 M/s
If we wish to update cipher-aes, we need to change the way it calls the NI routines. Right now it performs a dynamic check on the size of the AES key then dispatch based on that check. This is needed because we have one type for all routines (Key could mean 128,192, or 256 bits). If we had separate keys in Haskell, calling separate (but mostly identical) routines in C, then this dynamic check wouldn't be needed and we could reap our benefits.
An alternate, short-term, option is that I flesh-out and release the current code as "cipher-aes128". I'd have to add useful modes of operation, such as GCM, but I already have (and the benchmark used) the crypto-api interface.
I've implemented the code I would use and made a repo: https://github.com/TomMD/cipher-aes128
This is still painfully slow for GCM, but at least the basic framework is there, which saves almost 50ns (33%) when dealing with small payloads.
I've been looking at your trampoline code, and will definitely add something similar.
Althought I can't reasonably export more C functions (for support reason) that do almost the same thing just to differentiate the key size, and I can't remove the support for 192/256 bits support; so i need something slightly different. One thing that's preventing a simpler trampoline is the lack of support for 192 and 256 bits with aesni; maybe this is where i'll carry on.
For GCM, cipher-aes itself is slow too, and i think this is mainly due to my crappy GF implementation (and no pcmulqdq implementation yet).
i think it's mostly fixed in 0.2.x now. Using more table, with a much shorter initialization phase.
Yes, it is done and is a nice solution. Good work. I ran some benchmarks a little while ago and everything appeared to be competitive.
I see NI is enabled by a flag (and CPU test). I've previously made a trampoline for this type of case. If you are sure turning on NI and running on a non-NI CPU doesn't harm performance then we can delete this issue, if not then I will find time to make a trampoline patch.