Open magnumripper opened 6 years ago
EncFS and PGPdisk fixed (currently bot-checking). Not sure why the former didn't get it into #3216
Before (single host thread):
Device 6: GeForce GTX TITAN X
Benchmarking: EncFS-opencl [PBKDF2-SHA1 OpenCL AES]... DONE
Speed for cost 1 (iteration count) of 181474 and 181317
Raw: 6501 c/s real, 6501 c/s virtual, GPU util: 62%
Device 6: GeForce GTX TITAN X
Benchmarking: pgpdisk-opencl [SHA1 OpenCL]... DONE
Raw: 87771 c/s real, 87771 c/s virtual, GPU util: 95%
After:
Device 6: GeForce GTX TITAN X
Benchmarking: EncFS-opencl [PBKDF2-SHA1 AES OpenCL]... DONE
Speed for cost 1 (iteration count) of 181474 and 181317
Raw: 6597 c/s real, 6593 c/s virtual, GPU util: 99%
Benchmarking: pgpdisk-opencl [SHA1 AES/TwoFish/CAST OpenCL]... DONE
Speed for cost 1 (iteration count) of 16000, cost 2 (algorithm [3=CAST, 4=TwoFish, 5/6/7=AES) of 5
Raw: 53685 c/s real, 53685 c/s virtual, GPU util: 99%
This is a significant performance regression for pgpdisk even though we're only benchmarking AES here. Not sure why, I'll look into it.
I split the huge kernel into one per cipher for better and leaner builds. Also, in this case the nvidia liked the non-bitsliced AES better. Speed per cipher type, before all this (single-thread host):
Device 6: GeForce GTX TITAN X
Benchmarking: pgpdisk-opencl [SHA1 OpenCL]... DONE
Speed for cost 1 (iteration count) of 16000, cost 2 (algorithm [3=CAST, 4=TwoFish, 5/6/7=AES]) of 3
Raw: 199804 c/s real, 199804 c/s virtual, GPU util: 100%
Benchmarking: pgpdisk-opencl [SHA1 OpenCL]... DONE
Speed for cost 1 (iteration count) of 16000, cost 2 (algorithm [3=CAST, 4=TwoFish, 5/6/7=AES]) of 4
Raw: 70892 c/s real, 70892 c/s virtual, GPU util: 77%
Benchmarking: pgpdisk-opencl [SHA1 OpenCL]... DONE
Speed for cost 1 (iteration count) of 16000, cost 2 (algorithm [3=CAST, 4=TwoFish, 5/6/7=AES]) of 5
Raw: 87771 c/s real, 86994 c/s virtual, GPU util: 95%
Now (currently in PR for bot check)
Device 6: GeForce GTX TITAN X
Benchmarking: pgpdisk-opencl [SHA1 AES/TwoFish/CAST OpenCL]... DONE
Speed for cost 1 (iteration count) of 16000, cost 2 (algorithm [3=CAST, 4=TwoFish, 5/6/7=AES]) of 3
Raw: 211862 c/s real, 206521 c/s virtual, GPU util: 98%
Benchmarking: pgpdisk-opencl [SHA1 AES/TwoFish/CAST OpenCL]... DONE
Speed for cost 1 (iteration count) of 16000, cost 2 (algorithm [3=CAST, 4=TwoFish, 5/6/7=AES]) of 4
Raw: 119300 c/s real, 119300 c/s virtual, GPU util: 99%
Benchmarking: pgpdisk-opencl [SHA1 AES/TwoFish/CAST OpenCL]... DONE
Speed for cost 1 (iteration count) of 16000, cost 2 (algorithm [3=CAST, 4=TwoFish, 5/6/7=AES]) of 5
Raw: 90187 c/s real, 90187 c/s virtual, GPU util: 99%
Lesson learned. Note how poor we were doing with a single kernel for all three cipher types:
Device 6: GeForce GTX TITAN X
Benchmarking: pgpdisk-opencl [SHA1 AES/TwoFish/CAST OpenCL]... DONE
Speed for cost 1 (iteration count) of 16000, cost 2 (algorithm [3=CAST, 4=TwoFish, 5/6/7=AES]) of 3
Raw: 131657 c/s real, 130492 c/s virtual, GPU util: 99%
Benchmarking: pgpdisk-opencl [SHA1 AES/TwoFish/CAST OpenCL]... DONE
Speed for cost 1 (iteration count) of 16000, cost 2 (algorithm [3=CAST, 4=TwoFish, 5/6/7=AES]) of 4
Raw: 53169 c/s real, 53169 c/s virtual, GPU util: 99%
Benchmarking: pgpdisk-opencl [SHA1 AES/TwoFish/CAST OpenCL]... DONE
Speed for cost 1 (iteration count) of 16000, cost 2 (algorithm [3=CAST, 4=TwoFish, 5/6/7=AES]) of 5
Raw: 53685 c/s real, 53685 c/s virtual, GPU util: 99%
After the separation (no other changes!), CAST and Twofish nearly doubled. AES speed was still poor (same!) until I switched to non-bitsliced, then it too nearly doubled.
Note to self: I tried reverting the host-code to just use a single kernel 'pgpdisk' and add the below "master kernel" which in turn calls the separate kernels (assuming each cipher would stay a separate kernel-function as opposed to everything inlined in a single real kernel):
__kernel void pgpdisk(__global const pgpdisk_password *inbuffer,
__global pgpdisk_hash *outbuffer,
__constant pgpdisk_salt *salt)
{
switch (salt->algorithm) {
case 3:
pgpdisk_cast(inbuffer, outbuffer, salt);
break;
case 4:
pgpdisk_twofish(inbuffer, outbuffer, salt);
break;
default:
pgpdisk_aes(inbuffer, outbuffer, salt);
}
}
On nvidia, this was just as bad as having all functions inlined in a single kernel. Apparently the solution in 2f0fb89ee is superior. We may have a few more kernels that would benefit from this "trick".
On nvidia, this was just as bad as having all functions inlined in a single kernel.
In fact, this makes sense.
Have you tried to use compiler macros? Think of -D AES
, -D CAST
, and -D TWO
. In this case, the compiler can remove the unused code (shrink the binary). Otherwise, it can't.
PGPSDA and Ethereum fixed now as well. The latter (with Keccak moved to GPU) actually got a teensy bit slower (for now) but I prefer that over having OpenCL formats that clobber the CPUs with trivial work. PGPSDA (with CAST moved to GPU) got a very slight boost.
I'll review all formats before I close this issue. I believe there's more of them.
I'll review all formats before I close this issue. I believe there's more of them.
Added RAR5 to the list. When that's fixed, this task is complete.
From here on, I wont accept any new OpenCL format PR if it has any CPU-side post-processing - unless it's really needed like the handfull of existing ones mentioned in OP.
We may have a few more kernels that would benefit from this "trick".
ODF kernel might benefit, for one. It has BF and AES in a single kernel. PEM might be another one, with 3DES and AES. And Keepass has AES and TwoFish. DMG seems to do just fine but it does have AES and 3DES in same kernel. KeePass has TwoFish and ChaCha.
Since we generally use OpenMP in the host-side code of OpenCL formats running significant pre-/post-processing on host, I think this gives us a current exhaustive list of those formats:
$ fgrep -l _OPENMP opencl*.c | fgrep -v DES_bs
opencl_7z_fmt_plug.c
opencl_dashlane_fmt_plug.c
opencl_diskcryptor_fmt_plug.c
opencl_electrum_modern_fmt_plug.c
opencl_gpg_fmt_plug.c
opencl_rar_fmt_plug.c
opencl_tezos_fmt_plug.c
Edit: only Tezos and DiskCryptor are added compared to magum's list of "few formats will likely be impossible to get fully GPU-side". I think these two are exotic enough that we're OK leaving them as-is for now.
DiskCryptor [AES/TwoFish/Serpent XTS] is trivial if anyone wants it.
Will run 4 OpenMP threads
Device 4: GeForce GTX 1080
Benchmarking: diskcryptor-opencl, DiskCryptor [PBKDF2-SHA512 OpenCL]... (4xOMP) DONE
Speed for cost 1 (iteration count) of 1000
Raw: 144671 c/s real, 63442 c/s virtual, GPU util: 43%
Very poor GPU utilization so probably worthwhile.
Tezos [ed25519_publickey, blake2b], not likely feasible.
Will run 8 OpenMP threads
Device 4: GeForce GTX 1080
Benchmarking: tezos-opencl, Tezos Key [PBKDF2-SHA512 OpenCL]... (8xOMP) DONE
Speed for cost 1 (iteration count) of 2048
Raw: 90574 c/s real, 31148 c/s virtual, GPU util: 32%
I don't even get a figure with 4 threads (8 seconds).
IIRC, DiskCryptor is ransomware, so will probably be out of fashion or incompatible with versions in the wild or will use uncrackable passwords very soon if not already. We may consider dropping it after having kept it in 1.9.0-jumbo-1. I doubt it's worth putting time into unless we see more demand from current victims.
DiskCryptor is a 100% legit software (https://diskcryptor.net/wiki/Main_Page). It seems that it was even popular in the past (?).
My notes on this topic: In late 2018, a ransomware campaign which used DiskCryptor version 1.1.846.118 with following settings was discovered.
Note: The malicious actors used passwords that were numeric and varied in length from 8 digits to 11 digits.
Note 2: The malicious actors brought down a "security" company in UK
via a ransomware attack (powered by DiskCryptor). The initial entry point was an unsecured Windows RDP server.
See also #3216
These formats (and possibly a few more) should be reviewed for moving post-processing to GPU (and drop OMP) even though their GPU utilization is not too bad even running single-thread host:
For reference, these few formats will likely be impossible to get fully GPU-side (I will not even try):