Padlock: SHA1/256/512 code is never used

ValdikSS commented 1 year ago

It seems that SHA1 and SHA256 VIA Padlock functions were implemented in assembly and defined as C functions, but never exported in an engine. That's why these functions are never hardware-accelerated and assembly implementation is a dead code.

https://github.com/openssl/openssl/blob/38fc02a7084438e384e152effa84d4bf085783c9/engines/e_padlock.c#L221-L224

Moreover, the assembly contains implementation of sha512, which is not even mentioned in the .c file. https://github.com/openssl/openssl/blob/38fc02a7084438e384e152effa84d4bf085783c9/engines/asm/e_padlock-x86.pl#L579

As seen with current master, both openssl speed -evp sha256 and openssl speed -evp sha256 -engine padlock produce the same unimpressive result:

$ openssl speed -evp sha256 -engine padlock
Engine "padlock" set.
Doing sha256 for 3s on 16 size blocks: 500166 sha256's in 3.00s
Doing sha256 for 3s on 64 size blocks: 317539 sha256's in 2.99s
Doing sha256 for 3s on 256 size blocks: 169713 sha256's in 3.00s
Doing sha256 for 3s on 1024 size blocks: 57075 sha256's in 3.00s
Doing sha256 for 3s on 8192 size blocks: 7926 sha256's in 3.00s
Doing sha256 for 3s on 16384 size blocks: 3997 sha256's in 3.00s
version: 3.0.7
built on: Thu Jan 19 20:31:42 2023 UTC
options: bn(64,32)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -fzero-call-used-regs=used-gpr -DOPENSSL_TLS_SECURITY_LEVEL=2 -Wa,--noexecstack -g -O2 -ffile-prefix-map=/home/user/openssl/openssl-3.0.7=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wdate-time -D_FORTIFY_SOURCE=2
CPUINFO: OPENSSL_ia32cap=0x4181a7c9bfff:0x0
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha256            2667.55k     6796.82k    14482.18k    19481.60k    21643.26k    21828.95k

zzl360 commented 1 year ago

IMO，the padlock code are all c and assembly language, don't need any external binary.it can be embedded to sha-586.pl/sha-x86_64.pl/aes-586.pl and aes-x86_64.pl.

t8m commented 1 year ago

IMO，the padlock code are all c and assembly language, don't need any external binary.it can be embedded to sha-586.pl/sha-x86_64.pl/aes-586.pl and aes-x86_64.pl.

Yeah, this is basically an RFE. And IMO the proper way would be to use the padlock routines in the existing assembly code based on the CPU detection and not require an engine. Also if there is a reason why these cannot be integrated but must be loadable separately, they should be implemented in a provider and not an engine because we aren't likely to take any RFEs in engines. They are deprecated.

bernd-edlinger commented 1 year ago

I'm still wondering if the padlock hardware is still manufactured... We have big maintenance issues with hardware that is not widely used, and what happens when we want to fix anything in this code, but cannot test ourselves, and in a way the padlock engine is a good example, as it was obviously broken all the time, and never used for years, before you found some major issues in the implementation of that engine.

zzl360 commented 1 year ago

as https://github.com/openssl/openssl/pull/5145 mentioned，the zhaoxin company probably is still manufacturing hardware support padlock.

ValdikSS commented 1 year ago

@bernd-edlinger, as @zzl360 mentioned, Zhaoxin CPUs, which are pretty recent and actively developed as far as I can tell, support Padlock, including SM3/SM4 algorithms which were proposed in #8706.

I'm re-purposing old thin clients on VIA Eden Esther ULV, the CPU is slow and manages to get only 7-8MB/s of unaccelerated aes-128-gcm for example, while having a gigabit NIC, that's why hardware acceleration is very useful on this device. If one need access to this device, I can setup SSH and provide it indefinitely, or I can donate several devices to you.

it was obviously broken all the time, and never used for years

Could it be that the offending code with ifndef AES_ASM just wasn't activated by default until recently and everything eventually worked correctly? I can hardly believe such a bug hasn't been spotted by Zhaoxin developers.

https://github.com/openssl/openssl/blob/55ff8fb4ed4d48cb819ff5ae5d74cc08256e7ed1/engines/e_padlock.c#L649-L654

Current engine also does not support AES-GCM, which is one of the most widespread cipher on the internet. This could be easily hacked-up by reusing SPARC T4 code, but at least there's no implementation for this mode while there is for SHA, it's just not used.

zzl360 commented 1 year ago

@ValdikSS I'm not familar with engine in openssl.is it possible that you must manual enable padlock engine to enable hardware acceleration on VIA machine.but most user don't konw it.as https://github.com/openssl/openssl/issues/20167#issuecomment-1408253075 suggest,this will enable hardware acceleration by default on VIA machine.

ValdikSS commented 1 year ago

@zzl360, yes, engine needs to be activated either manually or in the configuration file. Right now it works pretty good overall, but without SHA or AES-GCM support.

bernd-edlinger commented 1 year ago

Could it be that the offending code with ifndef AES_ASM just wasn't activated by default until recently and everything eventually worked correctly? I can hardly believe such a bug hasn't been spotted by Zhaoxin developers.

That is a combination of several bugs. Most importantly starting with 1.1.1 the AES_ASM macro is NEVER defined when the engines are compiled, but only when the AES C files are compiled. That broke the engine first. Then we have 87bea6550ae0dda7c40937cff2e86cc2b0b09491 which was merged to 1.1.1 but not in master. That almost fixed the engine, except the issue in the padlock_key_bswap, modulo the AES_CONST_TIME issue.

bernd-edlinger commented 1 year ago

but without SHA or AES-GCM support.

and notably also the RNG support is disabled, that probably for a reason, that might be worth fixing too.

ValdikSS commented 1 year ago

I searched the internet and found some information.

Let's start with the issues. REP XSHA1 / REP XSHA256 always finalize SHA hashing and is not suitable for common sha_init / sha_update / sha_update / … / sha_finish calls. This is what is implemented under padlock_sha1_oneshot / padlock_sha256_oneshot functions.

VIA Nano family has added partial class of SHA functions which do not perform finalizing. This is done by supplying 0xFFFFFFFF in EAX before calling the command, and is implemented under padlock_sha1_blocks / padlock_sha256_blocks in OpenSSL (and doesn't include CPUID checks by the way). I could not find the documentation, but it is implemented in PadlockSDK_3.1_Release_20090121.zip (full source available). My device does not support partial hashing.

However, it's possible to do partial hashing on unsupported hardware with a page fault hack, proposed by Andy Polyakov and implemented by Michal Ludvig, but that would be difficult to implement in multi-thread environment and IMO not worth it.

PHE saves its current state into a memory on every process switch and as well on any page fault that occurs during the run. This state includes number of bytes hashed and an intermediate result that could be used as an initial value for subsequent rounds. So far so good. The only remaining question is how to trigger a context switch or a page fault at the place we need. Solution: mmap(2) two or more pages, mprotect(2) the last one to deny all access (PROT_NONE). This creates an inaccessible piece of memory exactly at the place we need. Now we put all our input data just before this barrier and engage PHE. However we'll tell it to hash slightly more data than we put into the buffer. With these instructions PHE will crunch all our input and attempt to hash some more.

Now, about SHA512. I could not find any information about it and it is not present in the SDK. Wikipedia X86_instruction_listings article even mentions it in the list of undocumented instructions:

Supported by OpenSSL as part of its VIA PadLock support, but not documented by the VIA PadLock Programming Guide

Now the patches. There were several patches in the mailing list for 1.0.x and 0.9.8 which haven't been merged:

ValdikSS commented 11 months ago

For the record, Zhaoxin maintains their own fork of OpenSSL.

sitb-urs commented 10 months ago

@bernd-edlinger Concerning the hardware -- yes it is very very old but there are these iBase stand-alone firewall appliances that use them as well (and some ITX mainboards) and making them even more performant would safe us from a little bit of trash. I will be happy to provide some hardware.

openssl / openssl

Padlock: SHA1/256/512 code is never used #20167