magnumripper commented 6 years ago

See also #3216

These formats (and possibly a few more) should be reviewed for moving post-processing to GPU (and drop OMP) even though their GPU utilization is not too bad even running single-thread host:

[x] Encfs [AES-CFB]
[x] PGPdisk [AES/TwoFish/CAST ECB]
[x] PGPSDA [CAST-ECB]
[x] Ethereum [Keccak]. This one doesn't even use (or need) OpenMP for the little post-processing because the GPU part is so very slow. Should be fixed anyway for good measure.
[x] RAR5 has some ridiculous CPU post-processing, super trivial to get GPU-side.
[ ] DiskCryptor [AES/TwoFish/Serpent XTS], should be trivial.

For reference, these few formats will likely be impossible to get fully GPU-side (I will not even try):

RAR (actually RAR3 non-hp) [AES-CBC, unrar]. There's no way anyone can implement unrar on GPU. However, GPU util is fine even running single-thread host due to excellent early rejection.
7z [LzmaDecode]. Fairly good early-rejection and GPU utilization even on single-thread host.
Electrum-modern [BIGNUM, secp256k1 and zlib inflate]. GPU util is ~0% even with 32 threads OMP.
GPG [IDEA, CAST5, Camellia and BIGNUM]. GPU util is poor even with 32 threads OMP.
Dashlane [AES-CBC, zlib inflate]. GPU util is 73% running one CPU thread and Titan X.
Tezos [ed25519_publickey, blake2b]. GPU util is 32% with 8xOMP.

magnumripper commented 6 years ago

EncFS and PGPdisk fixed (currently bot-checking). Not sure why the former didn't get it into #3216

Before (single host thread):

Device 6: GeForce GTX TITAN X
Benchmarking: EncFS-opencl [PBKDF2-SHA1 OpenCL AES]... DONE
Speed for cost 1 (iteration count) of 181474 and 181317
Raw:    6501 c/s real, 6501 c/s virtual, GPU util: 62%

Device 6: GeForce GTX TITAN X
Benchmarking: pgpdisk-opencl [SHA1 OpenCL]... DONE
Raw:    87771 c/s real, 87771 c/s virtual, GPU util: 95%

After:

Device 6: GeForce GTX TITAN X
Benchmarking: EncFS-opencl [PBKDF2-SHA1 AES OpenCL]... DONE
Speed for cost 1 (iteration count) of 181474 and 181317
Raw:    6597 c/s real, 6593 c/s virtual, GPU util: 99%

Benchmarking: pgpdisk-opencl [SHA1 AES/TwoFish/CAST OpenCL]... DONE
Speed for cost 1 (iteration count) of 16000, cost 2 (algorithm [3=CAST, 4=TwoFish, 5/6/7=AES) of 5
Raw:    53685 c/s real, 53685 c/s virtual, GPU util: 99%

This is a significant performance regression for pgpdisk even though we're only benchmarking AES here. Not sure why, I'll look into it.

magnumripper commented 6 years ago

I split the huge kernel into one per cipher for better and leaner builds. Also, in this case the nvidia liked the non-bitsliced AES better. Speed per cipher type, before all this (single-thread host):

Device 6: GeForce GTX TITAN X
Benchmarking: pgpdisk-opencl [SHA1 OpenCL]... DONE
Speed for cost 1 (iteration count) of 16000, cost 2 (algorithm [3=CAST, 4=TwoFish, 5/6/7=AES]) of 3
Raw:    199804 c/s real, 199804 c/s virtual, GPU util: 100%

Benchmarking: pgpdisk-opencl [SHA1 OpenCL]... DONE
Speed for cost 1 (iteration count) of 16000, cost 2 (algorithm [3=CAST, 4=TwoFish, 5/6/7=AES]) of 4
Raw:    70892 c/s real, 70892 c/s virtual, GPU util: 77%

Benchmarking: pgpdisk-opencl [SHA1 OpenCL]... DONE
Speed for cost 1 (iteration count) of 16000, cost 2 (algorithm [3=CAST, 4=TwoFish, 5/6/7=AES]) of 5
Raw:    87771 c/s real, 86994 c/s virtual, GPU util: 95%

Now (currently in PR for bot check)

Device 6: GeForce GTX TITAN X
Benchmarking: pgpdisk-opencl [SHA1 AES/TwoFish/CAST OpenCL]... DONE
Speed for cost 1 (iteration count) of 16000, cost 2 (algorithm [3=CAST, 4=TwoFish, 5/6/7=AES]) of 3
Raw:    211862 c/s real, 206521 c/s virtual, GPU util: 98%

Benchmarking: pgpdisk-opencl [SHA1 AES/TwoFish/CAST OpenCL]... DONE
Speed for cost 1 (iteration count) of 16000, cost 2 (algorithm [3=CAST, 4=TwoFish, 5/6/7=AES]) of 4
Raw:    119300 c/s real, 119300 c/s virtual, GPU util: 99%

Benchmarking: pgpdisk-opencl [SHA1 AES/TwoFish/CAST OpenCL]... DONE
Speed for cost 1 (iteration count) of 16000, cost 2 (algorithm [3=CAST, 4=TwoFish, 5/6/7=AES]) of 5
Raw:    90187 c/s real, 90187 c/s virtual, GPU util: 99%

magnumripper commented 6 years ago

Lesson learned. Note how poor we were doing with a single kernel for all three cipher types:

Device 6: GeForce GTX TITAN X
Benchmarking: pgpdisk-opencl [SHA1 AES/TwoFish/CAST OpenCL]... DONE
Speed for cost 1 (iteration count) of 16000, cost 2 (algorithm [3=CAST, 4=TwoFish, 5/6/7=AES]) of 3
Raw:    131657 c/s real, 130492 c/s virtual, GPU util: 99%

Benchmarking: pgpdisk-opencl [SHA1 AES/TwoFish/CAST OpenCL]... DONE
Speed for cost 1 (iteration count) of 16000, cost 2 (algorithm [3=CAST, 4=TwoFish, 5/6/7=AES]) of 4
Raw:    53169 c/s real, 53169 c/s virtual, GPU util: 99%

Benchmarking: pgpdisk-opencl [SHA1 AES/TwoFish/CAST OpenCL]... DONE
Speed for cost 1 (iteration count) of 16000, cost 2 (algorithm [3=CAST, 4=TwoFish, 5/6/7=AES]) of 5
Raw:    53685 c/s real, 53685 c/s virtual, GPU util: 99%

After the separation (no other changes!), CAST and Twofish nearly doubled. AES speed was still poor (same!) until I switched to non-bitsliced, then it too nearly doubled.

magnumripper commented 6 years ago

Note to self: I tried reverting the host-code to just use a single kernel 'pgpdisk' and add the below "master kernel" which in turn calls the separate kernels (assuming each cipher would stay a separate kernel-function as opposed to everything inlined in a single real kernel):

__kernel void pgpdisk(__global const pgpdisk_password *inbuffer,
                      __global pgpdisk_hash *outbuffer,
                      __constant pgpdisk_salt *salt)
{
        switch (salt->algorithm) {
        case 3:
                pgpdisk_cast(inbuffer, outbuffer, salt);
                break;
        case 4:
                pgpdisk_twofish(inbuffer, outbuffer, salt);
                break;
        default:
                pgpdisk_aes(inbuffer, outbuffer, salt);
        }
}

On nvidia, this was just as bad as having all functions inlined in a single kernel. Apparently the solution in 2f0fb89ee is superior. We may have a few more kernels that would benefit from this "trick".

claudioandre-br commented 6 years ago

On nvidia, this was just as bad as having all functions inlined in a single kernel.

In fact, this makes sense.

Have you tried to use compiler macros? Think of -D AES, -D CAST, and -D TWO. In this case, the compiler can remove the unused code (shrink the binary). Otherwise, it can't.

magnumripper commented 6 years ago

PGPSDA and Ethereum fixed now as well. The latter (with Keccak moved to GPU) actually got a teensy bit slower (for now) but I prefer that over having OpenCL formats that clobber the CPUs with trivial work. PGPSDA (with CAST moved to GPU) got a very slight boost.

I'll review all formats before I close this issue. I believe there's more of them.

magnumripper commented 6 years ago

I'll review all formats before I close this issue. I believe there's more of them.

Added RAR5 to the list. When that's fixed, this task is complete.

From here on, I wont accept any new OpenCL format PR if it has any CPU-side post-processing - unless it's really needed like the handfull of existing ones mentioned in OP.

magnumripper commented 6 years ago

We may have a few more kernels that would benefit from this "trick".

ODF kernel might benefit, for one. It has BF and AES in a single kernel. PEM might be another one, with 3DES and AES. And Keepass has AES and TwoFish. DMG seems to do just fine but it does have AES and 3DES in same kernel. KeePass has TwoFish and ChaCha.

solardiz commented 5 years ago

Since we generally use OpenMP in the host-side code of OpenCL formats running significant pre-/post-processing on host, I think this gives us a current exhaustive list of those formats:

$ fgrep -l _OPENMP opencl*.c | fgrep -v DES_bs
opencl_7z_fmt_plug.c
opencl_dashlane_fmt_plug.c
opencl_diskcryptor_fmt_plug.c
opencl_electrum_modern_fmt_plug.c
opencl_gpg_fmt_plug.c
opencl_rar_fmt_plug.c
opencl_tezos_fmt_plug.c

Edit: only Tezos and DiskCryptor are added compared to magum's list of "few formats will likely be impossible to get fully GPU-side". I think these two are exotic enough that we're OK leaving them as-is for now.

magnumripper commented 5 years ago

DiskCryptor [AES/TwoFish/Serpent XTS] is trivial if anyone wants it.

Will run 4 OpenMP threads
Device 4: GeForce GTX 1080
Benchmarking: diskcryptor-opencl, DiskCryptor [PBKDF2-SHA512 OpenCL]... (4xOMP) DONE
Speed for cost 1 (iteration count) of 1000
Raw:    144671 c/s real, 63442 c/s virtual, GPU util: 43%

Very poor GPU utilization so probably worthwhile.

Tezos [ed25519_publickey, blake2b], not likely feasible.

Will run 8 OpenMP threads
Device 4: GeForce GTX 1080
Benchmarking: tezos-opencl, Tezos Key [PBKDF2-SHA512 OpenCL]... (8xOMP) DONE
Speed for cost 1 (iteration count) of 2048
Raw:    90574 c/s real, 31148 c/s virtual, GPU util: 32%

I don't even get a figure with 4 threads (8 seconds).

solardiz commented 5 years ago

IIRC, DiskCryptor is ransomware, so will probably be out of fashion or incompatible with versions in the wild or will use uncrackable passwords very soon if not already. We may consider dropping it after having kept it in 1.9.0-jumbo-1. I doubt it's worth putting time into unless we see more demand from current victims.

solardiz commented 5 years ago

3762 suggests something called DiskCryptor isn't ransomware. Are these two different things of the same name, or does the ransomware actually reuse a pre-existing app that has legitimate uses? I'd appreciate it if someone in the know clarifies this, @ivan-freed.

kholia commented 5 years ago

DiskCryptor is a 100% legit software (https://diskcryptor.net/wiki/Main_Page). It seems that it was even popular in the past (?).

My notes on this topic: In late 2018, a ransomware campaign which used DiskCryptor version 1.1.846.118 with following settings was discovered.

Cipher - AES
Encryption mode - XTS
Pkcs5.2 pr - HMAC-SHA-512

Note: The malicious actors used passwords that were numeric and varied in length from 8 digits to 11 digits.

Note 2: The malicious actors brought down a "security" company in UK via a ransomware attack (powered by DiskCryptor). The initial entry point was an unsecured Windows RDP server.

openwall / john

OpenCL formats running post-process on CPU #3242

3762 suggests something called DiskCryptor isn't ransomware. Are these two different things of the same name, or does the ransomware actually reuse a pre-existing app that has legitimate uses? I'd appreciate it if someone in the know clarifies this, @ivan-freed.