Provide performance improvements to SM3 and SM4

kriskwiatkowski commented 6 years ago

Description

In order to improve performance of SM3 and SM4 (so also SM2 when used with SM3) rnp should take advantage of hardware acceleration of modern CPUs.

botan already uses hardware acceleration for some algorithms (i.e. AES) and rnp takes advantage of that fact.

Similar work was done in GmSSL. It could be interesting first step for this PR to use GmSSL and check how much boost the ASM implementation can give us (for reference).

Work is obviously to be done in botan.

@randombit Does it make sense?

ronaldtse commented 6 years ago

Certainly makes sense. However before we go down this path of optimizing crypto, we probably should have a full view of the performance of all our components, like what @ni4 did with AES/CAST5.

This way we’ll know what kind of MB/s people should expect with some combination of say OpenPGP encryption using RSA/SHA256/AES256 vs SM2/SM3/SM4.

ni4 commented 6 years ago

@ronaldtse I will provide some more detailed information later. For instance, from ~9 seconds for 500MB file encryption with AES only ~1 second is spent in botan/AES.

kriskwiatkowski commented 6 years ago

Yes, I agree. We need reference (and tools to build it). That's kind of pre-requisit.

BTW1: We would need same thing for CAST5. I think it is important as it was default algorithm for RFC4880 (or I'm wrong?). Which means - people out there are using it. So question - do you agree with this statement?

BTW2: Initially I was surprised with the comparision of rnp vs. gnupg. I thought gnupg can go much faster (like 10x) - libgcrypt implements most of the algorithms in ASM.

I did my own comparision and the result is:

#1. Small file symmetric encryption
ENCRYPT-SMALL-BINARY:RNP:47.22 runs/sec
ENCRYPT-SMALL-BINARY:GPG:209.57 runs/sec
ENCRYPT-SMALL-BINARY:RNP vs GPG:4.44
ENCRYPT-SMALL-ARMOUR:RNP:47.80 runs/sec
ENCRYPT-SMALL-ARMOUR:GPG:209.95 runs/sec
ENCRYPT-SMALL-ARMOUR:RNP vs GPG:4.39
#2. Large file symmetric encryption
ENCRYPT-AES128-BINARY:RNP:195.90 MB/sec
ENCRYPT-AES128-BINARY:GPG:280.26 MB/sec
ENCRYPT-AES128-BINARY:RNP vs GPG:1.43
ENCRYPT-AES192-BINARY:RNP:183.76 MB/sec
ENCRYPT-AES192-BINARY:GPG:256.56 MB/sec
ENCRYPT-AES192-BINARY:RNP vs GPG:1.40
ENCRYPT-AES256-BINARY:RNP:176.28 MB/sec
ENCRYPT-AES256-BINARY:GPG:234.73 MB/sec
ENCRYPT-AES256-BINARY:RNP vs GPG:1.33
ENCRYPT-TWOFISH-BINARY:RNP:103.01 MB/sec
ENCRYPT-TWOFISH-BINARY:GPG:123.11 MB/sec
ENCRYPT-TWOFISH-BINARY:RNP vs GPG:1.20
ENCRYPT-BLOWFISH-BINARY:RNP:74.54 MB/sec
ENCRYPT-BLOWFISH-BINARY:GPG:94.21 MB/sec
ENCRYPT-BLOWFISH-BINARY:RNP vs GPG:1.26
ENCRYPT-CAST5-BINARY:RNP:67.09 MB/sec
ENCRYPT-CAST5-BINARY:GPG:86.74 MB/sec
ENCRYPT-CAST5-BINARY:RNP vs GPG:1.29
ENCRYPT-CAMELLIA128-BINARY:RNP:81.17 MB/sec
ENCRYPT-CAMELLIA128-BINARY:GPG:107.52 MB/sec
ENCRYPT-CAMELLIA128-BINARY:RNP vs GPG:1.32
ENCRYPT-CAMELLIA192-BINARY:RNP:67.80 MB/sec
ENCRYPT-CAMELLIA192-BINARY:GPG:88.20 MB/sec
ENCRYPT-CAMELLIA192-BINARY:RNP vs GPG:1.30
ENCRYPT-CAMELLIA256-BINARY:RNP:69.12 MB/sec
ENCRYPT-CAMELLIA256-BINARY:GPG:89.22 MB/sec
ENCRYPT-CAMELLIA256-BINARY:RNP vs GPG:1.29
#3. Large file armoured encryption
ENCRYPT-LARGE-ARMOUR:RNP:107.00 MB/sec
ENCRYPT-LARGE-ARMOUR:GPG:83.71 MB/sec
ENCRYPT-LARGE-ARMOUR:RNP vs GPG:0.78

That's after applying @ni4 patch (which indeed is cool!).

So - same results for rnp but different for gpg. I think in previous comparision gnupg was compiled without hardware acceleration (hence my question in PR). RNP (botan) is by default compiled with hardware acceleration. So indeed, there is more that could be done in terms of perf - and I think performance could be improved both in crypto and rnp (I mean, independently).

randombit commented 6 years ago

Well I agree that with crypto, more faster = more better so I'm behind anything that improves the situation. But I am not sure pure asm is the right approach. For one it can be a big maintenance hassle, also it only benefits one particular platform vs another. So to be broadly useful the work has to be replicated for x86-32 ELF ABI, x86-32 Windows ABI, x86-64, aarch32, aarch64, etc. This works ok for OpenSSL, because they have someone who basically just full time writes/maintains their fast asm code. But there is basically only one me at the moment.

My general approach with Botan perf is to work with the compiler to write C++ that's easier to optimize, and use asm only as a last resort and only just where truly required to work around something the compiler isn't getting right. As an example I did some quick experiments with SM3, just restructuring the code, and improved performance by ~15%. https://github.com/randombit/botan/pull/1248 I think even more is possible but don't have time to look at it further today.

Similarly, for SM4 I think a lot can be done with unrolling, using larger tables (+ appropriate side channel countermeasures), SIMD implementation, etc.

ronaldtse commented 6 years ago

@flowher 's results are a bit disappointing for us 😭, but to think of it, we're quite close generally and we still have a faster armored speed. GPG is about 20-40% faster with larger files, which gap we should be able to close with some profiling effort.

For smaller files my hunch about the performance hit is in our own setup not crypto. But again we will need a full picture and some profiling to find out.

I think we should take @randombit 's approach with Botan on performance as well since it would be difficult for us to maintain asm... we could probably get more juice out by optimizing on a higher level. Unless @flowher has the time to do otherwise 😉

To answer @flowher 's BTWs:

Yes we should improve CAST5 but it will be deprecated in RFC 4880-bis02. So we probably should spend more effort on the future stuff.
Could you help profile some performance attributes of different areas in RNP and see what's holding us back? I imagine that the keystore, librepgp could be major areas for improvement.

kriskwiatkowski commented 6 years ago

@ronaldtse My guess is that problem with small files will be fairly easy to fix (seems we are doing something wrongly).

I'm happy to write and support asm stuff in botan (or general supporting good crypto). But actually @randombit makes good point here - optimizing C will be more beneficial. I kind of took it for granted that C code is already optimized and doesn't need perf improvements (which obviously isn't a case). So I can start with that. Nevertheless, would be interesting to see comparison - pure C vs pure ASM. If difference is small then there is no sense to maintain SM3 asm.

@ni4 are you already looking at what causes performance degradation "ENCRYPT-SMALL-BINARY"? Do you plan to do it ? (if not I'll take a look).

randombit commented 6 years ago

Yeah I guess to be clear if pure asm can produce a large bump in perf that we can't manage to get any other way (refactoring C++, using tiny bits of asm, intrinsics, etc) then I'm fine with asm. I just normally find gain isn't worth it. Botan used to have a bunch of handwritten asm that was faster than what GCC 2.95 could generate, but then GCC caught up and surpassed perf for all of them.

kriskwiatkowski commented 6 years ago

I see. Indeed, after doing some tests earlier during the day I'm quite impressed by gcc7.1 for intel (less with Clang).

ni4 commented 6 years ago

@flowher Yeah, I'm still looking at performance things however will be semi-available for the next few days. Regarding asm - I would follow C-side, actually encryption itself is not the 99%, and we cannot write all the stuff in asm, counting every bytecode.

rnpgp / rnp

Provide performance improvements to SM3 and SM4 #492

Description