Optimize intrinsic functions with pre-processor-fu

magnumripper commented 9 years ago

I'm pretty sure we can drop most branches in the SIMD functions using some clever rewrites and macros. The OpenCL functions does that. Well, that's not really a fair comparison since everything is always inlined but anyway.

magnumripper commented 9 years ago

Example: The actual MD4 function would be SIMD_MD4(w, a, b, c, d). Everything else would be preprocessor stuff and at compile time most of it would disappear and there would be NO branches. I'm pretty sure the boost would be noticable.

magnumripper commented 9 years ago

One thing that complicates things is that macros can't contain macros. So we can't put pseudo-intrinsics in macros...

jfoug commented 9 years ago

Debuggging is also very hard with macro usages like this. I am seeing this in parts of dyna. You want to be DAMN sure things are correct before macro-izing things, and have a fall back to allow debugging.

magnumripper commented 9 years ago

BTW this is very interesting:

Using our CPU intrinsics format, and AVX2 (all this is on well):

$ ../run/john -test -form=wpapsk
Benchmarking: wpapsk, WPA/WPA2 PSK [PBKDF2-SHA1 256/256 AVX2 8x]... (8xOMP) DONE
Raw:    13116 c/s real, 1649 c/s virtual

Same hardware, but using our OpenCL code (and an AVX2-aware CPU driver):

$ ../run/john -test -form=wpapsk-opencl -dev=0
Device 0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
Benchmarking: wpapsk-opencl, WPA/WPA2 PSK [PBKDF2-SHA1 OpenCL]... DONE
Raw:    13540 c/s real, 1699 c/s virtual

Above, the device asked for scalar code so we served it that, then it was auto-vectorized and actually faster than our intrinsics.

Here's forcing 8x vector source code:

$ ../run/john -test -form=wpapsk-opencl -dev=0 -force-vector=8
Device 0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
Benchmarking: wpapsk-opencl, WPA/WPA2 PSK [PBKDF2-SHA1 OpenCL 8x]... DONE
Raw:    13320 c/s real, 1668 c/s virtual

Slightly slower than auto-vectorized in this case, but still faster than our CPU format.

We don't get this good results for OpenCL with all formats, yet. But some day, we may. Scalar OpenCL is very easy to write, much much easier than writing plain CPU code using intrinsics.

magnumripper commented 9 years ago

One thing that complicates things is that macros can't contain macros. So we can't put pseudo-intrinsics in macros...

This is not correct. They can, but you'll need to be careful.

jfoug commented 9 years ago

How do you define a macro in a macro? This shortfall was why I did the dynamic-big-hash.c the way I did (using an external script to do my macro expansions).

magnumripper commented 9 years ago

You can not define a macro in a macro, but you can call a macro from a macro. Actually in as many levels as you want.

http://stackoverflow.com/questions/7972785/can-a-c-macro-definition-refer-to-other-macros

magnumripper commented 9 years ago

A little experiment is now in the cpp-intrinsics topic branch. Specifically 51f3fe6 for now.

Currently only MD4 & MD5 are done, and not completely. What is done, is there are now (behind the curtain) two different functions - one for a single/first block and another for "reload". Also, the "flat to interleaved" is moved to a separate function but that is also hidden by PP macros (mosty optimized away since SSEi_flags are constants).

Boost is 5-10% depending on format. Still, I'm not quite sure we want to walk this path.

openwall / john

Optimize intrinsic functions with pre-processor-fu #1720