vincenthz / hs-cipher-aes

DEPRECATED - use cryptonite - a comprehensive fast AES implementation for haskell that supports aesni and advanced cryptographic modes.
Other
22 stars 15 forks source link

Need a trampoline to pick between NI/gladman #8

Closed TomMD closed 11 years ago

TomMD commented 11 years ago

I see NI is enabled by a flag (and CPU test). I've previously made a trampoline for this type of case. If you are sure turning on NI and running on a non-NI CPU doesn't harm performance then we can delete this issue, if not then I will find time to make a trampoline patch.

TomMD commented 11 years ago

It seems a bit noisy, but my previous AES work, based on an older version of your NI and gladmans AES with a trampoline, does seem to show an improvement so perhaps I'll port a patch.

AES128_tmd:  120.4667 ns  126.5365 ns  174.2186 ns  349.0955 ns  588.7610 ns  2.029187 us  7.677496 us
               126.7 M/s    241.2 M/s    700.7 M/s   1398.7 M/s   1658.7 M/s   1925.0 M/s   2035.2 M/s
AES128:  169.6261 ns  179.7410 ns  207.4656 ns  358.6883 ns  541.5221 ns  1.690435 us  6.132310 us
                90.0 M/s    169.8 M/s    588.4 M/s   1361.3 M/s   1803.4 M/s   2310.8 M/s   2548.0 M/s
vincenthz commented 11 years ago

i'm not sure what your benchmark is benchmarking exactly, but 2 things on my list are:

TomMD commented 11 years ago

The benchmark

is showing the performance impact of my main difference from your cipher-aes pacakage is the trampolines. By 'trampolines' I am referring to the C routines and function pointers that first detect if NI is available then set the function pointer to always use NI (or not) without any further checks. This is in contrast to dynamically checking for NI on every single AES call.

In pseduo-C:

void *(*aes_encrypt_ecb_ptr(key *k, uint8_t* out, uint8_t* in, uint32_t nrBlocks)
                    = &detect_ni_and_encrypt;

detect_ni_and_encrypt(...)
{
    if(has_ni()) aes_encrypt_ecb_ptr = &aes_ni_encrypt_ecb;
    (*aes_encrypt_ecb_ptr)(...);
}

aes_encrypt_ecb(...)
{
    (*aes_encrypt_ecb_ptr)(...);
}

After updating to use your latest NI code, the trampoline method compares favorably to dynamically checking at every single call. In the below the most sensitive benchmark is the 16byte case (because the time to check for NI is the only difference in the computation, the actual cipher routine is identical):

  cipher|     16 bytes     32 bytes     64 bytes    512 bytes   1024 bytes   4096 bytes  16384 bytes
==================================================================================
AES128_tmd:  123.2306 ns  125.3117 ns  161.5419 ns  312.6124 ns  510.0473 ns  1.688639 us  6.348553 us
             123.8 M/s    243.5 M/s    755.7 M/s   1561.9 M/s   1914.7 M/s   2313.3 M/s   2461.2 M/s
AES128:  179.7151 ns  180.8886 ns  215.2212 ns  367.9218 ns  569.0574 ns  1.758038 us  6.448365 us
              84.9 M/s    168.7 M/s    567.2 M/s   1327.1 M/s   1716.1 M/s   2221.9 M/s   2423.1 M/s

Moving Forward

If we wish to update cipher-aes, we need to change the way it calls the NI routines. Right now it performs a dynamic check on the size of the AES key then dispatch based on that check. This is needed because we have one type for all routines (Key could mean 128,192, or 256 bits). If we had separate keys in Haskell, calling separate (but mostly identical) routines in C, then this dynamic check wouldn't be needed and we could reap our benefits.

An alternate, short-term, option is that I flesh-out and release the current code as "cipher-aes128". I'd have to add useful modes of operation, such as GCM, but I already have (and the benchmark used) the crypto-api interface.

TomMD commented 11 years ago

I've implemented the code I would use and made a repo: https://github.com/TomMD/cipher-aes128

This is still painfully slow for GCM, but at least the basic framework is there, which saves almost 50ns (33%) when dealing with small payloads.

vincenthz commented 11 years ago

I've been looking at your trampoline code, and will definitely add something similar.

Althought I can't reasonably export more C functions (for support reason) that do almost the same thing just to differentiate the key size, and I can't remove the support for 192/256 bits support; so i need something slightly different. One thing that's preventing a simpler trampoline is the lack of support for 192 and 256 bits with aesni; maybe this is where i'll carry on.

For GCM, cipher-aes itself is slow too, and i think this is mainly due to my crappy GF implementation (and no pcmulqdq implementation yet).

vincenthz commented 11 years ago

i think it's mostly fixed in 0.2.x now. Using more table, with a much shorter initialization phase.

TomMD commented 11 years ago

Yes, it is done and is a nice solution. Good work. I ran some benchmarks a little while ago and everything appeared to be competitive.