Closed magnumripper closed 9 years ago
That was a good catch. It is why I cringe at people writing all the inline stuff, just to gain a percent or 2, thus HIDING the things that can easily make better gains (such as improved algorithm or other simplification tricks). I know we have done many items recently that have unified code (the pbkdf2_*.h stuff is great examples).
Note to self: bitselect(x, y, z)
in XOP is _mm_cmov_si128(y, x, z)
(mind the order). z
is inverted.
All done.
Re-assigning to @zzlei, please test/benchmark on NEON and Altivec if/when you can. I will test for regressions in OpenCL and Intel CPU.
Added e8703bbe and 957a5387 too after realizing MD4/5 F()
is also same as Ch()
Oh, and MD4 G()
is same as SHA-2 Maj()
. 7071b4a9 and Solar found a new way of doing MD5 I()
using one less ops 382a96177.
I just tried it on Power. The only access I have to Power is through GCC farm, and it fluctuates so bad (too many users perhaps).
Here's just 3 consecutive runs:
[zlei@gcc2-power8 src]$ ../run/john --test --format=pbkdf2-hmac-sha1
Will run 152 OpenMP threads
Benchmarking: PBKDF2-HMAC-SHA1 [PBKDF2-SHA1 128/128 AltiVec 4x]... (152xOMP) DONE
Speed for cost 1 (iteration count) of 1000
Raw: 29257 c/s real, 2307 c/s virtual
[zlei@gcc2-power8 src]$ ../run/john --test --format=pbkdf2-hmac-sha1
Will run 152 OpenMP threads
Benchmarking: PBKDF2-HMAC-SHA1 [PBKDF2-SHA1 128/128 AltiVec 4x]... (152xOMP) DONE
Speed for cost 1 (iteration count) of 1000
Raw: 133032 c/s real, 1706 c/s virtual
[zlei@gcc2-power8 src]$ ../run/john --test --format=pbkdf2-hmac-sha1
Will run 152 OpenMP threads
Benchmarking: PBKDF2-HMAC-SHA1 [PBKDF2-SHA1 128/128 AltiVec 4x]... (152xOMP) DONE
Speed for cost 1 (iteration count) of 1000
Raw: 93388 c/s real, 1609 c/s virtual
I don't think I can get any useful benchmark results on this machine.
At least we know it's working :smile:
What if you run a lot fewer threads, like 4 or 8?
What if you run a lot fewer threads, like 4 or 8?
Yes, that works! I'll post the result on john-dev.
http://www.openwall.com/lists/john-dev/2015/09/02/5
Ch()
for CUDA and (non-bitselect) OpenCL to 3-op.