Closed magnumripper closed 8 years ago
Now I added interleaving to SHA512, mostly mimicking SHA1, and it works when SIMD_PARA_SHA512 = 1
. I noticed some formats are already using SIMD_PARA_SHA512
, e.g. sapH and Office, so I gave them a try. But they failed to work when SIMD_PARA_SHA512
is set other than 1.
I'm not sure whether I've done something wrong when adding interleaving to SHA512, or those formats are not using SIMD_PARA_SHA512
the proper way. Do you have any thoughts?
I would guess the formats lack some little detail. If you watch for every mention of SHA1_SSE_PARA and ensure the SHA512 version has a corresponding SIMD_PARA_SHA512, you should be almost set.
But there's also index calculations. They are harder to find because they do not really use the para macro. You need to verify any mention of SIMD_COEF_64 and verify that it does not calculate an index without honoring interleaving. Sunmd5 has both variants - here's the one that does NOT honor interleaving:
#define GETPOS(i, index) ( (((index)&(SIMD_COEF_32-1))<<2) + (((i)&(0xffffffff-3))*SIMD_COEF_32) + ((i)&3) )
Almost the same macro but honoring interleaving:
#define PARAGETPOS(i, index) ( (((index)&(SIMD_COEF_32-1))<<2) + (((i)&(0xffffffff-3))*SIMD_COEF_32) + ((i)&3) + (((unsigned int)index/SIMD_COEF_32*SIMD_COEF_32)<<6) )
So, in Office we have a good GETPOS macro (similar to the latter above) but it's only used for byte access. What about the index calculations for 32-bit or 64-bit access?
Line 456
// Iteration counter in first 4 bytes
for (j = 0; j < SHA512_LOOP_CNT; j++)
keys32[j * 2 + j/SIMD_COEF_64*32*SIMD_COEF_64 + 1] = i_be;
The telltale part is j/SIMD_COEF_64*xxx
so this one seems to be complete. Same for lines 464-465. Actually I can't spot a single place in Office where it's missing. But I'm really not sure. You might need to add debug prints. If self test says cmp_all(5) failed, dump data for index 4 in some places and verify everything looks like it should.
I added interleaving to SHA256 and managed to make it work with a few formats. Here're some statistics obtained from experimenting with the interleaving factor (pwsafe, sybasease & aix-ssha256, tested on well):
Benchmarking: pwsafe, Password Safe [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 2048 Raw: 44224 c/s real, 5528 c/s virtual
Benchmarking: sybasease, Sybase ASE [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Many salts: 8421K c/s real, 1052K c/s virtual Only one salt: 7503K c/s real, 935644 c/s virtual
Benchmarking: aix-ssha256, AIX LPA {ssha256} [PBKDF2-SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 64 Raw: 567296 c/s real, 70823 c/s virtual
Benchmarking: pwsafe, Password Safe [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 2048 Raw: 34978 c/s real, 4404 c/s virtual
Benchmarking: sybasease, Sybase ASE [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Many salts: 6815K c/s real, 853034 c/s virtual Only one salt: 6291K c/s real, 791378 c/s virtual
Benchmarking: aix-ssha256, AIX LPA {ssha256} [PBKDF2-SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 64 Raw: 461306 c/s real, 58240 c/s virtual
Benchmarking: pwsafe, Password Safe [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 2048 Raw: 35738 c/s real, 4483 c/s virtual
Benchmarking: sybasease, Sybase ASE [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Many salts: 6810K c/s real, 862958 c/s virtual Only one salt: 6553K c/s real, 857801 c/s virtual
Benchmarking: aix-ssha256, AIX LPA {ssha256} [PBKDF2-SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 64 Raw: 473088 c/s real, 59062 c/s virtual
Benchmarking: pwsafe, Password Safe [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 2048 Raw: 40554 c/s real, 5075 c/s virtual
Benchmarking: sybasease, Sybase ASE [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Many salts: 7453K c/s real, 986015 c/s virtual Only one salt: 7340K c/s real, 983918 c/s virtual
Benchmarking: aix-ssha256, AIX LPA {ssha256} [PBKDF2-SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 64 Raw: 523152 c/s real, 65637 c/s virtual
It seems interleaving doesn't give much help here.
BTW, when adding interleaving to SHA256, I realized that I might have done something wrong with SHA512, which renders it only functional with SIMD_PARA_SHA512 = 1
It's a pity it didn't bring any gain. What exact CPU was this, a desktop Haswell? Some older or future CPU (including the MIC, you should definitely try it!) may show better results. Also, SHA512 still might show a gain.
We should implement it fully and commit it anyway (but using para 1 for now) even if we can't find any current CPU which benefits from it. It might eventually get used for non-intel, the pseudo-intrinsics isn't necessarily bound to intel intrinsics.
Maybe here's a plan:
The previous experimentation was done on well, so it's Haswell.
Some formats don't work with the new SHA256 & SHA512 at the moment, so I don't thinks it's a good idea to commit to bleeding-jumbo
. I'm currently working on another temporary branch interleaving
in my repo. Is there some branch in the public repo to commit unstable code? unstable-jumbo
looks like one.
No, unstable-jumbo is an old branch based on core 1.7.9
But there should be no problem at all committing it as long as SIMD_PARA_SHAXX is defined to 1, right?
But there should be no problem at all committing it as long as SIMD_PARA_SHAXX is defined to 1, right?
Yes, that's right. I got it.
After 1629e65 no formats segfault for me even with ASan, but several still fails (using para 2). Now only a bunch of details left to fix :laughing:
@jfoug @zzlei I fail to see what is wrong with some SHA512 formats when trying with PARA 2 or 3. A very good example is Drupal7. Look at git diff b96ed88fc5009^ drupal7_fmt_plug.c
. Very simple fixes, I really can't see anything missing. It should work. So I've been looking into SHA512 in sse-intrinsics.c but can't see anything wrong there either.
Also, all formats using pbkdf2_hmac_sha512.h fails. But there's nothing wrong with it!?
I give them up for now and concentrate on trivial fixes for SHA224/256 for a while.
The only SHA256 format that fails is raw-sha256. All the rest are SHA512.
Cloudkeychain is PBKDF2-SHA512 but it doesn't use the shared function, it has a copy of its own. And that one, for some reason, works (or rather, it passes self-test. The tests doesn't catch all bugs).
After e229de81, all SHA256 formats pass the Test Suite. Most or all SHA512 formats fail (they might pass self-test but not the Test Suite).
I'm pretty sure that raw-sha512 (among others) are 100% right now (after 885c3cba5) but they still fail. I have been staring at sse-intrinsics.c a lot but can't see any problem there either.
On a side note I'm seeing good results for interleaving SHA256 on Haswell core i7 AVX2 (4790, gcc 4.8.2, 8xOMP w/ HT)
Raw-SHA256
1 34325K
2 32768K
3 35979K
4 38535K (+12%)
Raw-SHA384 (buggy code though - result may change)
1 22085K
2 21954K
3 17891K
4 17694K
With older core i7 mobile, AVX gcc 4.9.2 8xOMP w/ HT, I see no gain from interleaving (but loss).
Gotcha.
@zzlei you did right writing it like this
SHA512_PARA_DO(i) memcpy(w[i], &data[i * 16], 16 * sizeof(vtype));
I was confused by the fact SHA512 has a different way of "expanding" the buffer from 16 to 80 so I erroneously changed it (somewhat mislead by your comment "something's not right here" which turns out to be incorrect - it was 100% right).
After ce1f6e4, all formats pass self-test. Now on to Test Suite.
$ OMP_NUM_THREADS=3 ./jtrts.pl sha2 -q
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.13, Dec 21, 2014. By, Jim Fougeron & others
- Testing: John the Ripper password cracker, version 1.8.0.4-jumbo-1-bleeding_omp [darwin14.3.0 64-bit AVX-autoconf]
--------------------------------------------------------------------------------
Warning: SAP-B format should never be UTF-8.
Use --target-encoding=iso-8859-1 or whatever is applicable.
All tests passed without error. Performed 35 tests. Time used was 213 seconds
SHA512_PARA_DO(i) memcpy(w[i], &data[i * 16], 16 * sizeof(vtype));
Well, when I first tested the interleaved SHA512, every format which takes the above code path failed self-test. I guess that statement is somehow incorrect, but memcpy(w, &data, 16 * sizeof(vtype) *SIMD_PARA_SHA512)
doesn't seem right to me, either. So I just left that comment. Sorry for misleading you :)
It does make sense since we have w[80] and something like data[16]. The reasons they failed self-test were other things, now fixed.
On a different note, we might want to try interleaving differently. Here's part of SHA1 (MD4 and MD5 are made similarly):
#define SHA1_ROUND2(a,b,c,d,e,F,t) \
SHA1_PARA_DO(i) tmp3[i] = tmpR[i*16+(t&0xF)]; \
SHA1_EXPAND2(t+16) \
F(b,c,d) \
SHA1_PARA_DO(i) e[i] = vadd_epi32( e[i], tmp[i] ); \
SHA1_PARA_DO(i) tmp[i] = vroti_epi32(a[i], 5); \
SHA1_PARA_DO(i) e[i] = vadd_epi32( e[i], tmp[i] ); \
SHA1_PARA_DO(i) e[i] = vadd_epi32( e[i], cst ); \
SHA1_PARA_DO(i) e[i] = vadd_epi32( e[i], tmp3[i] ); \
SHA1_PARA_DO(i) b[i] = vroti_epi32(b[i], 30);
And here's how it goes in SHA256 and SHA512
#define SHA256_STEP0(a,b,c,d,e,f,g,h,x,K) \
{ \
SHA256_PARA_DO(i) \
{ \
w = _w[i].w; \
tmp1[i] = vadd_epi32(h[i], S1(e[i])); \
tmp1[i] = vadd_epi32(tmp1[i], Ch(e[i],f[i],g[i])); \
tmp1[i] = vadd_epi32(tmp1[i], vset1_epi32(K)); \
tmp1[i] = vadd_epi32(tmp1[i], w[x]); \
tmp2[i] = vadd_epi32(S0(a[i]),Maj(a[i],b[i],c[i])); \
d[i] = vadd_epi32(tmp1[i], d[i]); \
h[i] = vadd_epi32(tmp1[i], tmp2[i]); \
} \
}
One difference is that the former is almost guaranteed to be fully unrolled why the latter might not.
Another thing I realized is SHA512 has a larger footprint. It uses w[80] throughout the function while SHA256 uses w[16] and handles the expansion in STEP_R. We should try changing that and see what happens. I think we should discuss both these things with Solar, he usually can tell by heart what it would mean to caches and pipelines :)
One difference is that the former is almost guaranteed to be fully unrolled why the latter might not.
I did notice the difference when I wrote it. But honestly I didn't get the point of wrapping each statement with a for
clause. How does that help unrolling?
Maybe it gets unrolled anyway... but unrolled or not, the order of instructions ends up different (or rather, we leave more to the optimizer - which may be good or bad).
Current SHA1:
foo(w[0]);
foo(w[1]);
foo(w[2]);
foo(w[3]);
bar(w[0]);
bar(w[1]);
bar(w[2]);
bar(w[3]);
Current SHA2:
foo(w[0]);
bar(w[0]);
foo(w[1]);
bar(w[1]);
foo(w[2]);
bar(w[2]);
foo(w[3]);
bar(w[3]);
Assuming 'foo' is actually a load and 'bar' is an operation, the difference might be significant. OTOH assuming a good optimizer, this will be shuffled around to the best anyway.
After ce1f6e4, all formats pass self-test. Now on to Test Suite.
Awesome. Guess I can start benchmarking for MIC now?
Please do. I think you should now use pbkdf2-hmac-sha256 and pbkdf2-hmac-sha512 for benching.
For each "para" between 1 and 5 (or more), try using 240 threads as well as 60 forks x 4 threads. And perhaps even using no threads. I have a feeling you should also try decreasing OMP_SCALE to 1 for pbkdf2-hmac-sha256 (it is 1 already for pbkdf2-hmac-sha512).
Please state exact compiler version when documenting, and post results to john-dev.
BTW the positive results I saw for Haswell seem to be gone now.
i7-4790 (HT) AVX2 gcc 4.8.2
hash\para | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
SHA256 | 950 | 784 | 752 | 768 | 784 |
SHA512 | 143 | 136 | 106 | 107 | 113 |
8x SHA256 | 3692 | 2742 | 2671 | 2898 | 2931 |
8x SHA512 | 533 | 506 | 390 | 426 | 400 |
E5-2670 (HT) AVX gcc 4.8.2
hash\para | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
SHA256 | 320 | 248 | 252 | 264 | 269 |
SHA512 | 47.0 | 33.6 | 34.6 | 33.0 | 36.0 |
14x SHA256 | 2661 | 2018 | 2067 | 2036 | 2133 |
14x SHA512 | 392 | 316 | 305 | 268 | 285 |
AMD 8435 (six-core) SSE2 gcc 4.8.2
hash\para | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
SHA256 | 208 | 196 | 192 | 192 | 194 |
SHA512 | 33.6 | 32.4 | 25.4 | 26.8 | 25.6 |
6x SHA256 | 1235 | 1168 | 1152 | 1163 | 1161 |
6x SHA512 | 198 | 194 | 152 | 161 | 153 |
240 threads on MIC:
pbkdf2-hmac-sha256 x1 Raw: 9142 c/s real, 38.0 c/s virtual x2 Raw: 6592 c/s real, 27.9 c/s virtual x3 Raw: 7078 c/s real, 29.6 c/s virtual x4 Raw: 7253 c/s real, 30.4 c/s virtual x5 Raw: 7305 c/s real, 30.5 c/s virtual
pbkdf2-hmac-sha512 x1 Raw: 426 c/s real, 1.8 c/s virtual x2 Raw: 474 c/s real, 1.9 c/s virtual x3 Raw: 482 c/s real, 2.0 c/s virtual x4 Raw: 506 c/s real, 2.1 c/s virtual x5 Raw: 509 c/s real, 2.1 c/s virtual
BTW, OMP_SCALE = 1
and OMP_SCALE = 4
has nearly the same performance on MIC.
Somehow I couldn't get useful info from a forked run of pbkdf2-hmac-sha256/512. I used the same settings as I benchmarked raw-md4(5), but only got output like:
Will run 4 OpenMP threads per process (240 total across 60 processes)
Node numbers 1-60 of 60 (fork)
Session stopped (max run-time reached)
I tried increasing --max-run
, but to no avail. Something wrong here?
I guess the total time for one crypt_all() is longer than the timer abort grace time. I think it's equivalent to pressing 'q' and then press it again after 30 seconds unless session has ended. Could it be that one call is longer than that? Maybe you used many salts? Try only loading one.
Posted to john-dev: http://www.openwall.com/lists/john-dev/2015/05/29/13
I guess the total time for one crypt_all() is longer than the timer abort grace time. I think it's equivalent to pressing 'q' and then press it again after 30 seconds unless session has ended. Could it be that one call is longer than that? Maybe you used many salts? Try only loading one.
I'm cracking only one hash here. And even when I increase --max-run
to 5 mins, the results are still all zeroes.
[zhanglei@mic0 zhanglei]$ run/john --format=pbkdf2-hmac-sha256 --mask=?l?l?l?l?l?l?l?l --fork=60 --max-run=300 hash.sha256
Loaded 1 password hash (PBKDF2-HMAC-SHA256 [PBKDF2-SHA256 512/512 MIC 16x])
Will run 4 OpenMP threads per process (240 total across 60 processes)
Node numbers 1-60 of 60 (fork)
41 0g 0:00:00:00 0g/s 0p/s 0c/s 0C/s
32 0g 0:00:00:00 0g/s 0p/s 0c/s 0C/s
30 0g 0:00:00:00 0g/s 0p/s 0c/s 0C/s
52 0g 0:00:00:00 0g/s 0p/s 0c/s 0C/s
48 0g 0:00:00:00 0g/s 0p/s 0c/s 0C/s
47 0g 0:00:00:00 0g/s 0p/s 0c/s 0C/s
49 0g 0:00:00:00 0g/s 0p/s 0c/s 0C/s
50 0g 0:00:00:00 0g/s 0p/s 0c/s 0C/s
60 0g 0:00:00:00 0g/s 0p/s 0c/s 0C/s
Press 'q' or Ctrl-C to abort, almost any other key for status
1 0g 0:00:00:00 0g/s 0p/s 0c/s 0C/s
Waiting for 59 children to terminate
Session stopped (max run-time reached)
But self-test doesn't take long for pbkdf2-hmac-sha256.
That's very strange. You should debug it and sort out what happens. Maybe your test hash has a lot higher iteration count? Or maybe some bug make it spin for ever.
@zzlei be sure to read http://www.openwall.com/lists/john-dev/2015/05/30/6 and follow-ups
core i7 (AVX) laptop, 8xOMP/HT, OMP_SCALE=1, gcc 4.9.2
testpara.sh run in bleeding-jumbo
hash\para | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
MD4 | 29963K | 41507K | 43383K | 42435K | 41394K |
MD5 | 99008 | 134336 | 143809 | 142848 | 132960 |
SHA1 | 40608 | 40384 | 24192 | 19643 | 17920 |
SHA256 | 1489 | 1317 | 1280 | 1291 | 1358 |
SHA512 | 233 | 182 | 182 | 171 | 169 |
testpara.sh run in topic branch "intrinsics-loops"
hash\para | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
MD4 | 30368K | 40815K | 46460K | 45496K | 44937K |
MD5 | 98656 | 138368 | 150273 | 150016 | 125600 |
SHA1 | 40832 | 38848 | 35904 | 28134 | 24480 |
SHA256 | 1474 | 1353 | 1345 | 1366 | 1398 |
SHA512 | 233 | 182 | 184 | 179 | 184 |
SHA1 variants. @jfoug's SHA-1 expansion is ingenious, can't beat that. But lots of redundant temporary space can be dropped!
tweak\para | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
tmpR -> w[16] | 40672 | 39360 | 27264 | 20096 | 18851 |
tmp[i] -> tmp | 41120 | 38590 | 36192 | 28544 | 24712 |
testpara.sh in bleeding-jumbo after c81f637a7, now with assorted things from that topic branch merged.
core i7 (AVX) laptop, 8xOMP/HT, OMP_SCALE=1, gcc 4.9.2
hash\para | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
MD4 | 28092K | 39937K | 45921K | 45297K | 43778K |
MD5 | 96352 | 137216 | 150432 | 149248 | 125440 |
SHA1 | 40480 | 38083 | 35712 | 28416 | 23445 |
SHA256 | 1362 | 1353 | 1371 | 1366 | 1384 |
SHA512 | 228 | 184 | 184 | 181 | 180 |
I believe the slight regressions seen in SHA-2 are random variations.
testpara.sh in bleeding-jumbo after d85f8fd, now with tmp[SIMD_PARA] back (only that change).
core i7 (AVX) laptop, 8xOMP/HT, OMP_SCALE=1, gcc 4.9.2
hash\para | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
MD4 | 29197K | 41156K | 46727K | 46756K | 43805K |
MD5 | 89056 | 138432 | 150624 | 149888 | 117280 |
SHA1 | 40672 | 38208 | 34788 | 27776 | 20480 |
SHA256 | 1520 | 1344 | 1304 | 1315 | 1358 |
SHA512 | 228 | 181 | 182 | 169 | 173 |
Decreased w[80] pad of SHA512 to w[16] using same sliding-window technique as SHA1 & SHA256. Just a little gain... if any. Interestingly enough it had a bad effect on interleaving. Perhaps I should try the opposite with SHA1 and see what happens.
core i7 (AVX) laptop, 8xOMP/HT, OMP_SCALE=1, gcc 4.9.2
hash\para | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
SHA512 | 235 | 156 | 168 | 164 | 164 |
Tried a few other things
"narrow for loops"
hash\para | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
SHA256 | 1489 | 869 | 838 | 837 | 806 |
SHA512 | 235 | 137 | 122 | 119 | 125 |
"drop tmp2"
hash\para | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
SHA256 | 1536 | 1317 | 1317 | 1340 | 1371 |
SHA512 | 230 | 186 | 184 | 182 | 184 |
I'm just throwing things at it and see what sticks. This is probably pointless, nothing is conclusive except reading asm output...
icc 14.0.0 (super) AVX 32x (NT is non-OMP) NOTE figures fluctuate a lot between runs despite no load on the system.
current bleeding-jumbo (518b8c4e)
hash\para | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
MD4 | 28708K | 40725K | 43426K | 41231K | 39213K |
MD5 | 505472 | 672000 | 372480 | 347648 | 576640 |
SHA1 | 86400 | 86528 | 44483 | 45623 | 46257 |
SHA256 | 3513 | 2816 | 4942 | 4970 | 3200 |
SHA512 | FAILED (cmp_all(1)) |
I can't reproduce that SHA512 failure elsewhere.
Older code (pre cde0fb47):
hash\para | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
MD4 | 28849K | 40491K | 43419K | 39478K | 37279K |
MD5 | 295296 | 341248 | 682752 | 317845 | 560000 |
SHA1 | 91648 | 88832 | 68815 | 37512 | 33280 |
SHA256 | 3764 | 2681 | 2671 | 2844 | 2990 |
SHA512 | FAILED (cmp_all(1)) |
Current bleeding, but with per-line loops:
hash\para | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
MD4 | 28570K | 39664K | 43190K | 39783K | 38362K |
MD5 | 302592 | 340736 | 338688 | 611840 | 296554 |
SHA1 | 86656 | 92160 | 124800 | 35328 | 32000 |
SHA256 | 3513 | 1812 | 2021 | 1721 | 1443 |
SHA512 | FAILED (cmp_all(1)) |
Well, AVX2, current code (a277eb6)
gcc version 4.9.2 (GCC)
hash\para | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
nt | 60519K | 83153K | 90451K | 92775K | 84013K |
nt-omp | 81330K | 89391K | 88276K | 77594K | 66355K |
md5crypt | 56440 | 84496 | 93384 | 91552 | 79480 |
md5crypt-omp | 325056 | 385536 | 394560 | 368128 | 320320 |
pbkdf2-hmac-sha1 | 25896 | 25328 | 23477 | 15588 | 14680 |
pbkdf2-hmac-sha1-omp | 109696 | 99072 | 89856 | 59136 | 57920 |
pbkdf2-hmac-sha256 | 934 | 776 | 736 | 745 | 769 |
pbkdf2-hmac-sha256-omp | 3712 | 2886 | 2823 | 2953 | 3047 |
pbkdf2-hmac-sha512 | 149 | 120 | 117 | 121 | 119 |
pbkdf2-hmac-sha512-omp | 576 | 448 | 436 | 457 | 436 |
With per-line interleaving loops:
hash\para | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
nt | 60391K | 82787K | 88486K | 87884K | 82609K |
nt-omp | 81559K | 88801K | 86310K | 77725K | 69795K |
md5crypt | 56696 | 82240 | 90552 | 85728 | 79200 |
md5crypt-omp | 325184 | 380032 | 376704 | 344064 | 318400 |
pbkdf2-hmac-sha1 | 25888 | 25376 | 21216 | 19552 | 18400 |
pbkdf2-hmac-sha1-omp | 109632 | 98432 | 80832 | 74752 | 69386 |
pbkdf2-hmac-sha256 | 936 | 887 | 800 | 823 | 653 |
pbkdf2-hmac-sha256-omp | 3738 | 3355 | 3200 | 3200 | 2509 |
pbkdf2-hmac-sha512 | 150 | 139 | 126 | 117 | 99 |
pbkdf2-hmac-sha512-omp | 576 | 523 | 514 | 449 | 369 |
super, AVX, 16xOMP
gcc version 4.8.1 20130715 (Red Hat 4.8.1-4) (GCC)
hash\para | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
nt | 29035K | 40249K | 45447K | 44843K | 35481K |
nt-omp | 62455K | 64356K | 62029K | 65273K | 64225K |
md5crypt | 24324 | 37008 | 41244 | 40736 | 27640 |
md5crypt-omp | 338368 | 512384 | 569856 | 577792 | 393280 |
pbkdf2-hmac-sha1 | 11252 | 11064 | 7848 | 6816 | 4320 |
pbkdf2-hmac-sha1-omp | 160640 | 159104 | 112348 | 98560 | 57920 |
pbkdf2-hmac-sha256 | 400 | 308 | 308 | 323 | 320 |
pbkdf2-hmac-sha256-omp | 5824 | 4266 | 4705 | 4430 | 4660 |
pbkdf2-hmac-sha512 | 63 | 43 | 44 | 44 | 43 |
pbkdf2-hmac-sha512-omp | 461 | 651 | 628 | 621 | 625 |
With per-line interleaving loops:
hash\para | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
nt | 29326K | 39912K | 43238K | 42755K | 42564K |
nt-omp | 62357K | 62521K | 64585K | 65667K | 63733K |
md5crypt | 24340 | 36472 | 38520 | 38064 | 36960 |
md5crypt-omp | 339264 | 495616 | 556800 | 533504 | 522880 |
pbkdf2-hmac-sha1 | 11256 | 11216 | 5664 | 3136 | 2792 |
pbkdf2-hmac-sha1-omp | 160640 | 161024 | 94859 | 47872 | 40960 |
pbkdf2-hmac-sha256 | 400 | 160 | 128 | 120 | 140 |
pbkdf2-hmac-sha256-omp | 5760 | 2316 | 1864 | 1774 | 2018 |
pbkdf2-hmac-sha512 | 63 | 25 | 20 | 16 | 20 |
pbkdf2-hmac-sha512-omp | 909 | 369 | 276 | 253 | 230 |
All OpenMP figures are OMP_SCALE 1, except NT.
How did SHA512 fail self-test? Is it specific with icc?
Not sure if it's a compiler bug or something nasty in the code. I haven't seen it anywhere else. This is icc 14.0.0 on "super", maybe you have some other version to try?
icc 15.0.2 works fine on my Linux VM. I'd guess that's an icc issue, but I'm not sure.
Added PBKDF2-HMAC formats for MD4 and MD5 just to make testparas.pl better. And changed number of iterations to 1000 for all of them:
gcc version 5.1.0 (Homebrew gcc5 5.1.0) John the Ripper password cracker, version 1.8.0.6-jumbo-1-bleeding_omp [darwin14.3.0 64-bit AVX-autoconf]
hash\para | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
md4 | 18836 | 29861 | 33100 | 33424 | 30780 |
md4-omp | 79520 | 120128 | 121920 | 120192 | 112000 |
md5 | 13532 | 21584 | 24420 | 22976 | 21465 |
md5-omp | 60736 | 86400 | 87920 | 84352 | 79360 |
sha1 | 10736 | 10952 | 8928 | 4032 | 3740 |
sha1-omp | 41312 | 39744 | 34176 | 19968 | 19840 |
sha256 | 4664 | 2384 | 3516 | 3952 | 4120 |
sha256-omp | 16736 | 10560 | 13782 | 15207 | 14891 |
sha512 | 1881 | 839 | 1290 | 1512 | 1524 |
sha512-omp | 6848 | 3808 | 4800 | 5639 | 5386 |
Not sure if it's a compiler bug or something nasty in the code. I haven't seen it anywhere else. This is icc 14.0.0 on "super", maybe you have some other version to try?
This seems a similar issue as https://github.com/magnumripper/JohnTheRipper/issues/1452#issuecomment-113814032 on MIC. I turned off auto-vectorization, and the problem is gone.
Here's the right result on MIC (I'll delete the previous one):
icc version 14.0.0 (gcc version 4.4.7 compatibility) John the Ripper password cracker, version 1.8.0.6-jumbo-1-bleeding_omp [linux-gnu 64-bit mic-autoconf]
hash\para | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
md4 | 5687 | 6526 | 6510 | 6209 | 6196 |
md4-omp | 669148 | 737882 | 711529 | 662588 | 466019 |
md5 | 4182 | 4942 | 5037 | 5005 | 5048 |
md5-omp | 520871 | 536854 | 513267 | 462291 | 447378 |
sha1 | 2598 | 2321 | 1411 | 1415 | 1346 |
sha1-omp | 282352 | 253514 | 180705 | 173886 | 163018 |
sha256 | 1077 | 855 | 830 | 887 | 880 |
sha256-omp | 119300 | 97882 | 96000 | 98642 | 97627 |
sha512 | 123 | 137 | 154 | 165 | 172 |
sha512-omp | 15567 | 17614 | 19525 | 20389 | 21333 |
BTW, we've been stalling at this issue for a while. Which direction should we push in next?
Please report those latest figures on john-dev, including a notice we are now benching "PBKDF2-HMAC" even for MD4 and MD5.
Also, let's bring that question to john-dev for discussion. What would you prefer doing now? Actually it might be a good time reviewing the project definition and make a timeline on remaining tasks (also listing what is already done). Please do so, and we'll discuss it on john-dev.
@zzlei our mic.h still defaults to para 1 for all hashes. Perhaps we should set it to 2 for MD4/MD5? Then I think we can close this issue for now.
@magnumripper I agree. I'll change it.
This is a GSoC task.
Some is already done: There are SIMD_PARA_SHA256 and SIMD_PARA_SHA512 defined in some header files. And some formats does use them in index calculations (although they are currently hard-coded to 1). I think some formats does not yet. The SHA-2 functions in sse-intrinsics.c needs to have a little code added.