LM/DEScrypt OpenCL performance

magnumripper commented 8 years ago

We should try to match the performance seen with Meriken's tripcode engine, and now oclHashcat, for our LM and DEScrypt implementations. Something like 20Gc/s for LM and 800Mc/s for DEScrypt on Titan X.

The S-boxes for LOP3.LUT are already implemented. The bitselect ones might need a review too though.

Some hints from "DeepLearningJohnDoe": http://www.openwall.com/lists/john-users/2015/10/10/3 https://github.com/DeepLearningJohnDoe/merikens-tripcode-engine https://github.com/DeepLearningJohnDoe/SLUT https://github.com/DeepLearningJohnDoe/SBOXDiscovery

Hint from Atom: There's only just one thing in NV that's important. One need to call s1-s8 inside a loop (i=0;i<2;i++) and #pragma unroll 1

magnumripper commented 8 years ago

Oh BTW oclHashcat doesn't use a single goto in any of its kernels, including descrypt. Other than checking that (with grep) I haven't looked at the descrypt kernel differences at all. But it's fair game to do so now.

magnumripper commented 8 years ago

FWIW our current lm-opencl is almost twice as fast on Titan X (also a little faster on Intel CPU) if changing opencl_lm_b_plug.c so it doesn't pass -D WORK_GROUP_SIZE to the kernel build.

magnumripper commented 8 years ago

Atom now managed to get another 8-14% boost for AMD in https://github.com/hashcat/oclHashcat/commit/245301c9b444616aa083ef8fda83decbc690f4c1#commitcomment-15457781 - now 12G c/s for LM and 405M c/s for descrypt.

meriken commented 8 years ago

Meriken's Tripcode Engine recently got significant boosts with new kernels written in GCN assembly with CLRadeonExtender.

AMD Radeon HD 7990 1022MH/s (descrypt; 1250mV +20% 1180MHz) NVIDIA GeForce GTX 980 Ti 996MH/s (descrypt; 110% +250MHz) AMD Radeon HD 290X 647MH/s (descrypt; +100mV +50% 1074MHz)

I highly recommend CLRadeonExtender as it is cross-platform and very reliable.

magnumripper commented 8 years ago

Thanks @meriken for the heads-up. I presume the nvidia speedup came from other improvements?

Now if we could find @Sayantan2048...

meriken commented 8 years ago

The NVIDIA speedup is from @DeepLearningJohnDoe's work. I incorporated her Maxwell implementation with 4096 OpenCL kernels. It's pretty fast, but the executable file is huge at 924MB. I am trying to come up with a better way to manage these CUDA kernels.

It's just too bad that @Sayantan2048 is not actively involved in development now. I just took a quick look at OpenCL kernels of JtR, and I already see a lot of room for improvements. For example, the expansion function for salts can be "embedded" into the kernel for a speedup and a reduced VGPR count if you treat the expansion function as a bunch of constant values and rebuild kernels periodically with different salts. I am pretty busy right now, but I may have time to play with the JtR code base. If I come up with something useful, I will let you know.

solardiz commented 4 years ago

(The move of this repo to Openwall org brought my attention to this issue, as it somehow got automatically unassigned from Sayantan, who isn't a member of the org currently, and thus the issue got "recently modified".)

the expansion function for salts can be "embedded" into the kernel for a speedup and a reduced VGPR count if you treat the expansion function as a bunch of constant values and rebuild kernels periodically with different salts.

We already build the up to 4096 per-salt kernels (on-demand as salts are seen). In fact, ideally we'd also support a mode where we would optionally not be doing that, or where the per-salt kernel builds would be postponed further - for faster startup.

We'd appreciate it if you @meriken take over maintenance of the OpenCL bitslice DES stuff in JtR, or even rewrite this code in a cleaner and faster fashion while also making it more convenient to use (the mandatory per-salt kernels are actually a usability drawback for quick runs after a new install or on a new GPU).

Edit: see also #1919 #2666

solardiz commented 3 months ago

Hint from Atom: There's only just one thing in NV that's important. One need to call s1-s8 inside a loop (i=0;i<2;i++) and #pragma unroll 1

Looking at our code now, we have two kinds of kernels - with fully unrolled DES and with 2 rounds unrolled (which is quite natural as it allows for fixed indices to be used). We do not have a rolled version. We should probably implement that.

openwall / john

LM/DEScrypt OpenCL performance #1908