Unify SHA-1's H() a.k.a F3() a.k.a SHA-2's Maj() implementations

magnumripper commented 9 years ago

http://www.openwall.com/lists/john-dev/2015/09/02/5

[x] Change all OpenCL definitions using bitselect to 2-op version.
[x] Change all OpenCL non bitselect fallbacks, and CUDA versions, to 4-op version.
[x] Change all Ch() for CUDA and (non-bitselect) OpenCL to 3-op.
[x] Same for SIMD intrinsics.
[x] Have a look at the scalar plain C stuff while at it.

jfoug commented 9 years ago

That was a good catch. It is why I cringe at people writing all the inline stuff, just to gain a percent or 2, thus HIDING the things that can easily make better gains (such as improved algorithm or other simplification tricks). I know we have done many items recently that have unified code (the pbkdf2_*.h stuff is great examples).

magnumripper commented 9 years ago

Note to self: bitselect(x, y, z) in XOP is _mm_cmov_si128(y, x, z) (mind the order). z is inverted.

magnumripper commented 9 years ago

All done.

Re-assigning to @zzlei, please test/benchmark on NEON and Altivec if/when you can. I will test for regressions in OpenCL and Intel CPU.

magnumripper commented 9 years ago

Added e8703bbe and 957a5387 too after realizing MD4/5 F() is also same as Ch()

magnumripper commented 9 years ago

Oh, and MD4 G() is same as SHA-2 Maj(). 7071b4a9 and Solar found a new way of doing MD5 I() using one less ops 382a96177.

lei-april commented 9 years ago

I just tried it on Power. The only access I have to Power is through GCC farm, and it fluctuates so bad (too many users perhaps).

Here's just 3 consecutive runs:

[zlei@gcc2-power8 src]$ ../run/john --test --format=pbkdf2-hmac-sha1
Will run 152 OpenMP threads
Benchmarking: PBKDF2-HMAC-SHA1 [PBKDF2-SHA1 128/128 AltiVec 4x]... (152xOMP) DONE
Speed for cost 1 (iteration count) of 1000
Raw:    29257 c/s real, 2307 c/s virtual

[zlei@gcc2-power8 src]$ ../run/john --test --format=pbkdf2-hmac-sha1
Will run 152 OpenMP threads
Benchmarking: PBKDF2-HMAC-SHA1 [PBKDF2-SHA1 128/128 AltiVec 4x]... (152xOMP) DONE
Speed for cost 1 (iteration count) of 1000
Raw:    133032 c/s real, 1706 c/s virtual

[zlei@gcc2-power8 src]$ ../run/john --test --format=pbkdf2-hmac-sha1
Will run 152 OpenMP threads
Benchmarking: PBKDF2-HMAC-SHA1 [PBKDF2-SHA1 128/128 AltiVec 4x]... (152xOMP) DONE
Speed for cost 1 (iteration count) of 1000
Raw:    93388 c/s real, 1609 c/s virtual

I don't think I can get any useful benchmark results on this machine.

magnumripper commented 9 years ago

At least we know it's working :smile:

What if you run a lot fewer threads, like 4 or 8?

lei-april commented 9 years ago

What if you run a lot fewer threads, like 4 or 8?

Yes, that works! I'll post the result on john-dev.

openwall / john

Unify SHA-1's H() a.k.a F3() a.k.a SHA-2's Maj() implementations #1727