Open ValdikSS opened 1 year ago
IMO,the padlock code are all c and assembly language, don't need any external binary.it can be embedded to sha-586.pl/sha-x86_64.pl/aes-586.pl and aes-x86_64.pl.
IMO,the padlock code are all c and assembly language, don't need any external binary.it can be embedded to sha-586.pl/sha-x86_64.pl/aes-586.pl and aes-x86_64.pl.
Yeah, this is basically an RFE. And IMO the proper way would be to use the padlock routines in the existing assembly code based on the CPU detection and not require an engine. Also if there is a reason why these cannot be integrated but must be loadable separately, they should be implemented in a provider and not an engine because we aren't likely to take any RFEs in engines. They are deprecated.
I'm still wondering if the padlock hardware is still manufactured... We have big maintenance issues with hardware that is not widely used, and what happens when we want to fix anything in this code, but cannot test ourselves, and in a way the padlock engine is a good example, as it was obviously broken all the time, and never used for years, before you found some major issues in the implementation of that engine.
as https://github.com/openssl/openssl/pull/5145 mentioned,the zhaoxin company probably is still manufacturing hardware support padlock.
@bernd-edlinger, as @zzl360 mentioned, Zhaoxin CPUs, which are pretty recent and actively developed as far as I can tell, support Padlock, including SM3/SM4 algorithms which were proposed in #8706.
I'm re-purposing old thin clients on VIA Eden Esther ULV, the CPU is slow and manages to get only 7-8MB/s of unaccelerated aes-128-gcm for example, while having a gigabit NIC, that's why hardware acceleration is very useful on this device. If one need access to this device, I can setup SSH and provide it indefinitely, or I can donate several devices to you.
it was obviously broken all the time, and never used for years
Could it be that the offending code with ifndef AES_ASM
just wasn't activated by default until recently and everything eventually worked correctly? I can hardly believe such a bug hasn't been spotted by Zhaoxin developers.
Current engine also does not support AES-GCM, which is one of the most widespread cipher on the internet. This could be easily hacked-up by reusing SPARC T4 code, but at least there's no implementation for this mode while there is for SHA, it's just not used.
@ValdikSS I'm not familar with engine in openssl.is it possible that you must manual enable padlock engine to enable hardware acceleration on VIA machine.but most user don't konw it.as https://github.com/openssl/openssl/issues/20167#issuecomment-1408253075 suggest,this will enable hardware acceleration by default on VIA machine.
@zzl360, yes, engine needs to be activated either manually or in the configuration file. Right now it works pretty good overall, but without SHA or AES-GCM support.
Could it be that the offending code with ifndef AES_ASM just wasn't activated by default until recently and everything eventually worked correctly? I can hardly believe such a bug hasn't been spotted by Zhaoxin developers.
That is a combination of several bugs. Most importantly starting with 1.1.1 the AES_ASM macro is NEVER defined when the engines are compiled, but only when the AES C files are compiled. That broke the engine first. Then we have 87bea6550ae0dda7c40937cff2e86cc2b0b09491 which was merged to 1.1.1 but not in master. That almost fixed the engine, except the issue in the padlock_key_bswap, modulo the AES_CONST_TIME issue.
but without SHA or AES-GCM support.
and notably also the RNG support is disabled, that probably for a reason, that might be worth fixing too.
I searched the internet and found some information.
Let's start with the issues. REP XSHA1
/ REP XSHA256
always finalize SHA hashing and is not suitable for common sha_init / sha_update / sha_update / … / sha_finish
calls. This is what is implemented under padlock_sha1_oneshot
/ padlock_sha256_oneshot
functions.
VIA Nano family has added partial
class of SHA functions which do not perform finalizing. This is done by supplying 0xFFFFFFFF
in EAX before calling the command, and is implemented under padlock_sha1_blocks
/ padlock_sha256_blocks
in OpenSSL (and doesn't include CPUID checks by the way). I could not find the documentation, but it is implemented in PadlockSDK_3.1_Release_20090121.zip (full source available). My device does not support partial hashing.
However, it's possible to do partial hashing on unsupported hardware with a page fault hack, proposed by Andy Polyakov and implemented by Michal Ludvig, but that would be difficult to implement in multi-thread environment and IMO not worth it.
PHE saves its current state into a memory on every process switch and as well on any page fault that occurs during the run. This state includes number of bytes hashed and an intermediate result that could be used as an initial value for subsequent rounds. So far so good. The only remaining question is how to trigger a context switch or a page fault at the place we need. Solution: mmap(2) two or more pages, mprotect(2) the last one to deny all access (PROT_NONE). This creates an inaccessible piece of memory exactly at the place we need. Now we put all our input data just before this barrier and engage PHE. However we'll tell it to hash slightly more data than we put into the buffer. With these instructions PHE will crunch all our input and attempt to hash some more.
Now, about SHA512. I could not find any information about it and it is not present in the SDK. Wikipedia X86_instruction_listings article even mentions it in the list of undocumented instructions:
Supported by OpenSSL as part of its VIA PadLock support, but not documented by the VIA PadLock Programming Guide
Now the patches. There were several patches in the mailing list for 1.0.x and 0.9.8 which haven't been merged:
For the record, Zhaoxin maintains their own fork of OpenSSL.
@bernd-edlinger Concerning the hardware -- yes it is very very old but there are these iBase stand-alone firewall appliances that use them as well (and some ITX mainboards) and making them even more performant would safe us from a little bit of trash. I will be happy to provide some hardware.
It seems that SHA1 and SHA256 VIA Padlock functions were implemented in assembly and defined as C functions, but never exported in an engine. That's why these functions are never hardware-accelerated and assembly implementation is a dead code.
https://github.com/openssl/openssl/blob/38fc02a7084438e384e152effa84d4bf085783c9/engines/e_padlock.c#L221-L224
Moreover, the assembly contains implementation of
sha512
, which is not even mentioned in the .c file. https://github.com/openssl/openssl/blob/38fc02a7084438e384e152effa84d4bf085783c9/engines/asm/e_padlock-x86.pl#L579As seen with current master, both
openssl speed -evp sha256
andopenssl speed -evp sha256 -engine padlock
produce the same unimpressive result: