openwall / john

John the Ripper jumbo - advanced offline password cracker, which supports hundreds of hash and cipher types, and runs on many operating systems, CPUs, GPUs, and even some FPGAs
https://www.openwall.com/john/
Other
10.21k stars 2.09k forks source link

Add interleaving to primary SHA-2 intrinsics functions. #1217

Closed magnumripper closed 8 years ago

magnumripper commented 9 years ago

This is a GSoC task.

Some is already done: There are SIMD_PARA_SHA256 and SIMD_PARA_SHA512 defined in some header files. And some formats does use them in index calculations (although they are currently hard-coded to 1). I think some formats does not yet. The SHA-2 functions in sse-intrinsics.c needs to have a little code added.

lei-april commented 9 years ago

Now I added interleaving to SHA512, mostly mimicking SHA1, and it works when SIMD_PARA_SHA512 = 1. I noticed some formats are already using SIMD_PARA_SHA512, e.g. sapH and Office, so I gave them a try. But they failed to work when SIMD_PARA_SHA512 is set other than 1.

I'm not sure whether I've done something wrong when adding interleaving to SHA512, or those formats are not using SIMD_PARA_SHA512 the proper way. Do you have any thoughts?

magnumripper commented 9 years ago

I would guess the formats lack some little detail. If you watch for every mention of SHA1_SSE_PARA and ensure the SHA512 version has a corresponding SIMD_PARA_SHA512, you should be almost set.

But there's also index calculations. They are harder to find because they do not really use the para macro. You need to verify any mention of SIMD_COEF_64 and verify that it does not calculate an index without honoring interleaving. Sunmd5 has both variants - here's the one that does NOT honor interleaving:

#define GETPOS(i, index)            ( (((index)&(SIMD_COEF_32-1))<<2) + (((i)&(0xffffffff-3))*SIMD_COEF_32) + ((i)&3) )

Almost the same macro but honoring interleaving:

#define PARAGETPOS(i, index)        ( (((index)&(SIMD_COEF_32-1))<<2) + (((i)&(0xffffffff-3))*SIMD_COEF_32) + ((i)&3) + (((unsigned int)index/SIMD_COEF_32*SIMD_COEF_32)<<6) )

So, in Office we have a good GETPOS macro (similar to the latter above) but it's only used for byte access. What about the index calculations for 32-bit or 64-bit access?

Line 456

        // Iteration counter in first 4 bytes
        for (j = 0; j < SHA512_LOOP_CNT; j++)
            keys32[j * 2 + j/SIMD_COEF_64*32*SIMD_COEF_64 + 1] = i_be;

The telltale part is j/SIMD_COEF_64*xxxso this one seems to be complete. Same for lines 464-465. Actually I can't spot a single place in Office where it's missing. But I'm really not sure. You might need to add debug prints. If self test says cmp_all(5) failed, dump data for index 4 in some places and verify everything looks like it should.

lei-april commented 9 years ago

I added interleaving to SHA256 and managed to make it work with a few formats. Here're some statistics obtained from experimenting with the interleaving factor (pwsafe, sybasease & aix-ssha256, tested on well):

SIMD_PARA_SHA256 = 1

Benchmarking: pwsafe, Password Safe [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 2048 Raw: 44224 c/s real, 5528 c/s virtual

Benchmarking: sybasease, Sybase ASE [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Many salts: 8421K c/s real, 1052K c/s virtual Only one salt: 7503K c/s real, 935644 c/s virtual

Benchmarking: aix-ssha256, AIX LPA {ssha256} [PBKDF2-SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 64 Raw: 567296 c/s real, 70823 c/s virtual

SIMD_PARA_SHA256 = 2

Benchmarking: pwsafe, Password Safe [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 2048 Raw: 34978 c/s real, 4404 c/s virtual

Benchmarking: sybasease, Sybase ASE [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Many salts: 6815K c/s real, 853034 c/s virtual Only one salt: 6291K c/s real, 791378 c/s virtual

Benchmarking: aix-ssha256, AIX LPA {ssha256} [PBKDF2-SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 64 Raw: 461306 c/s real, 58240 c/s virtual

SIMD_PARA_SHA256 = 4

Benchmarking: pwsafe, Password Safe [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 2048 Raw: 35738 c/s real, 4483 c/s virtual

Benchmarking: sybasease, Sybase ASE [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Many salts: 6810K c/s real, 862958 c/s virtual Only one salt: 6553K c/s real, 857801 c/s virtual

Benchmarking: aix-ssha256, AIX LPA {ssha256} [PBKDF2-SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 64 Raw: 473088 c/s real, 59062 c/s virtual

SIMD_PARA_SHA256 = 8

Benchmarking: pwsafe, Password Safe [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 2048 Raw: 40554 c/s real, 5075 c/s virtual

Benchmarking: sybasease, Sybase ASE [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Many salts: 7453K c/s real, 986015 c/s virtual Only one salt: 7340K c/s real, 983918 c/s virtual

Benchmarking: aix-ssha256, AIX LPA {ssha256} [PBKDF2-SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 64 Raw: 523152 c/s real, 65637 c/s virtual

It seems interleaving doesn't give much help here.

BTW, when adding interleaving to SHA256, I realized that I might have done something wrong with SHA512, which renders it only functional with SIMD_PARA_SHA512 = 1

magnumripper commented 9 years ago

It's a pity it didn't bring any gain. What exact CPU was this, a desktop Haswell? Some older or future CPU (including the MIC, you should definitely try it!) may show better results. Also, SHA512 still might show a gain.

We should implement it fully and commit it anyway (but using para 1 for now) even if we can't find any current CPU which benefits from it. It might eventually get used for non-intel, the pseudo-intrinsics isn't necessarily bound to intel intrinsics.

Maybe here's a plan:

  1. Test this on the MIC too and post the results. BTW you should also post them (as well as the above, or just link to it) to john-dev for Solar to comment.
  2. Commit what you've got now (if it's ready for it) but obviously keep the para defined to 1. Me and others can help out fixing the rest of the SHA256 formats sooner or later.
  3. Finish SHA512 (at least one or two formats) and test this too on AVX2 and on the MIC, and post the results. And commit that too.
lei-april commented 9 years ago

The previous experimentation was done on well, so it's Haswell.

Some formats don't work with the new SHA256 & SHA512 at the moment, so I don't thinks it's a good idea to commit to bleeding-jumbo. I'm currently working on another temporary branch interleaving in my repo. Is there some branch in the public repo to commit unstable code? unstable-jumbo looks like one.

magnumripper commented 9 years ago

No, unstable-jumbo is an old branch based on core 1.7.9

But there should be no problem at all committing it as long as SIMD_PARA_SHAXX is defined to 1, right?

lei-april commented 9 years ago

But there should be no problem at all committing it as long as SIMD_PARA_SHAXX is defined to 1, right?

Yes, that's right. I got it.

magnumripper commented 9 years ago

After 1629e65 no formats segfault for me even with ASan, but several still fails (using para 2). Now only a bunch of details left to fix :laughing:

magnumripper commented 9 years ago

@jfoug @zzlei I fail to see what is wrong with some SHA512 formats when trying with PARA 2 or 3. A very good example is Drupal7. Look at git diff b96ed88fc5009^ drupal7_fmt_plug.c. Very simple fixes, I really can't see anything missing. It should work. So I've been looking into SHA512 in sse-intrinsics.c but can't see anything wrong there either.

Also, all formats using pbkdf2_hmac_sha512.h fails. But there's nothing wrong with it!?

I give them up for now and concentrate on trivial fixes for SHA224/256 for a while.

magnumripper commented 9 years ago

The only SHA256 format that fails is raw-sha256. All the rest are SHA512.

Cloudkeychain is PBKDF2-SHA512 but it doesn't use the shared function, it has a copy of its own. And that one, for some reason, works (or rather, it passes self-test. The tests doesn't catch all bugs).

magnumripper commented 9 years ago

After e229de81, all SHA256 formats pass the Test Suite. Most or all SHA512 formats fail (they might pass self-test but not the Test Suite).

I'm pretty sure that raw-sha512 (among others) are 100% right now (after 885c3cba5) but they still fail. I have been staring at sse-intrinsics.c a lot but can't see any problem there either.

magnumripper commented 9 years ago

On a side note I'm seeing good results for interleaving SHA256 on Haswell core i7 AVX2 (4790, gcc 4.8.2, 8xOMP w/ HT)

Raw-SHA256

1 34325K
2 32768K
3 35979K
4 38535K (+12%)

Raw-SHA384 (buggy code though - result may change)

1 22085K
2 21954K
3 17891K
4 17694K

With older core i7 mobile, AVX gcc 4.9.2 8xOMP w/ HT, I see no gain from interleaving (but loss).

magnumripper commented 9 years ago

Gotcha.

@zzlei you did right writing it like this

    SHA512_PARA_DO(i) memcpy(w[i], &data[i * 16], 16 * sizeof(vtype));

I was confused by the fact SHA512 has a different way of "expanding" the buffer from 16 to 80 so I erroneously changed it (somewhat mislead by your comment "something's not right here" which turns out to be incorrect - it was 100% right).

magnumripper commented 9 years ago

After ce1f6e4, all formats pass self-test. Now on to Test Suite.

magnumripper commented 9 years ago
$ OMP_NUM_THREADS=3 ./jtrts.pl sha2 -q
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.13, Dec 21, 2014.  By, Jim Fougeron & others
- Testing:  John the Ripper password cracker, version 1.8.0.4-jumbo-1-bleeding_omp [darwin14.3.0 64-bit AVX-autoconf]
--------------------------------------------------------------------------------
Warning: SAP-B format should never be UTF-8.
Use --target-encoding=iso-8859-1 or whatever is applicable.
All tests passed without error.  Performed 35 tests.  Time used was 213 seconds
lei-april commented 9 years ago

SHA512_PARA_DO(i) memcpy(w[i], &data[i * 16], 16 * sizeof(vtype));

Well, when I first tested the interleaved SHA512, every format which takes the above code path failed self-test. I guess that statement is somehow incorrect, but memcpy(w, &data, 16 * sizeof(vtype) *SIMD_PARA_SHA512) doesn't seem right to me, either. So I just left that comment. Sorry for misleading you :)

magnumripper commented 9 years ago

It does make sense since we have w[80] and something like data[16]. The reasons they failed self-test were other things, now fixed.

magnumripper commented 9 years ago

On a different note, we might want to try interleaving differently. Here's part of SHA1 (MD4 and MD5 are made similarly):

#define SHA1_ROUND2(a,b,c,d,e,F,t) \
    SHA1_PARA_DO(i) tmp3[i] = tmpR[i*16+(t&0xF)]; \
    SHA1_EXPAND2(t+16) \
    F(b,c,d) \
    SHA1_PARA_DO(i) e[i] = vadd_epi32( e[i], tmp[i] ); \
    SHA1_PARA_DO(i) tmp[i] = vroti_epi32(a[i], 5); \
    SHA1_PARA_DO(i) e[i] = vadd_epi32( e[i], tmp[i] ); \
    SHA1_PARA_DO(i) e[i] = vadd_epi32( e[i], cst ); \
    SHA1_PARA_DO(i) e[i] = vadd_epi32( e[i], tmp3[i] ); \
    SHA1_PARA_DO(i) b[i] = vroti_epi32(b[i], 30);

And here's how it goes in SHA256 and SHA512

#define SHA256_STEP0(a,b,c,d,e,f,g,h,x,K)                    \
{                                                            \
    SHA256_PARA_DO(i)                                        \
    {                                                        \
        w = _w[i].w;                                         \
        tmp1[i] = vadd_epi32(h[i],    S1(e[i]));             \
        tmp1[i] = vadd_epi32(tmp1[i], Ch(e[i],f[i],g[i]));   \
        tmp1[i] = vadd_epi32(tmp1[i], vset1_epi32(K));       \
        tmp1[i] = vadd_epi32(tmp1[i], w[x]);                 \
        tmp2[i] = vadd_epi32(S0(a[i]),Maj(a[i],b[i],c[i]));  \
        d[i]    = vadd_epi32(tmp1[i], d[i]);                 \
        h[i]    = vadd_epi32(tmp1[i], tmp2[i]);              \
    }                                                        \
}

One difference is that the former is almost guaranteed to be fully unrolled why the latter might not.

magnumripper commented 9 years ago

Another thing I realized is SHA512 has a larger footprint. It uses w[80] throughout the function while SHA256 uses w[16] and handles the expansion in STEP_R. We should try changing that and see what happens. I think we should discuss both these things with Solar, he usually can tell by heart what it would mean to caches and pipelines :)

lei-april commented 9 years ago

One difference is that the former is almost guaranteed to be fully unrolled why the latter might not.

I did notice the difference when I wrote it. But honestly I didn't get the point of wrapping each statement with a for clause. How does that help unrolling?

magnumripper commented 9 years ago

Maybe it gets unrolled anyway... but unrolled or not, the order of instructions ends up different (or rather, we leave more to the optimizer - which may be good or bad).

Current SHA1:

    foo(w[0]);
    foo(w[1]);
    foo(w[2]);
    foo(w[3]);
    bar(w[0]);
    bar(w[1]);
    bar(w[2]);
    bar(w[3]);

Current SHA2:

    foo(w[0]);
    bar(w[0]);
    foo(w[1]);
    bar(w[1]);
    foo(w[2]);
    bar(w[2]);
    foo(w[3]);
    bar(w[3]);

Assuming 'foo' is actually a load and 'bar' is an operation, the difference might be significant. OTOH assuming a good optimizer, this will be shuffled around to the best anyway.

lei-april commented 9 years ago

After ce1f6e4, all formats pass self-test. Now on to Test Suite.

Awesome. Guess I can start benchmarking for MIC now?

magnumripper commented 9 years ago

Please do. I think you should now use pbkdf2-hmac-sha256 and pbkdf2-hmac-sha512 for benching.

For each "para" between 1 and 5 (or more), try using 240 threads as well as 60 forks x 4 threads. And perhaps even using no threads. I have a feeling you should also try decreasing OMP_SCALE to 1 for pbkdf2-hmac-sha256 (it is 1 already for pbkdf2-hmac-sha512).

Please state exact compiler version when documenting, and post results to john-dev.

magnumripper commented 9 years ago

BTW the positive results I saw for Haswell seem to be gone now.

i7-4790 (HT) AVX2 gcc 4.8.2

hash\para 1 2 3 4 5
SHA256 950 784 752 768 784
SHA512 143 136 106 107 113
8x SHA256 3692 2742 2671 2898 2931
8x SHA512 533 506 390 426 400

E5-2670 (HT) AVX gcc 4.8.2

hash\para 1 2 3 4 5
SHA256 320 248 252 264 269
SHA512 47.0 33.6 34.6 33.0 36.0
14x SHA256 2661 2018 2067 2036 2133
14x SHA512 392 316 305 268 285

AMD 8435 (six-core) SSE2 gcc 4.8.2

hash\para 1 2 3 4 5
SHA256 208 196 192 192 194
SHA512 33.6 32.4 25.4 26.8 25.6
6x SHA256 1235 1168 1152 1163 1161
6x SHA512 198 194 152 161 153
lei-april commented 9 years ago

240 threads on MIC:

pbkdf2-hmac-sha256 x1 Raw: 9142 c/s real, 38.0 c/s virtual x2 Raw: 6592 c/s real, 27.9 c/s virtual x3 Raw: 7078 c/s real, 29.6 c/s virtual x4 Raw: 7253 c/s real, 30.4 c/s virtual x5 Raw: 7305 c/s real, 30.5 c/s virtual

pbkdf2-hmac-sha512 x1 Raw: 426 c/s real, 1.8 c/s virtual x2 Raw: 474 c/s real, 1.9 c/s virtual x3 Raw: 482 c/s real, 2.0 c/s virtual x4 Raw: 506 c/s real, 2.1 c/s virtual x5 Raw: 509 c/s real, 2.1 c/s virtual

BTW, OMP_SCALE = 1 and OMP_SCALE = 4 has nearly the same performance on MIC.

lei-april commented 9 years ago

Somehow I couldn't get useful info from a forked run of pbkdf2-hmac-sha256/512. I used the same settings as I benchmarked raw-md4(5), but only got output like:

Will run 4 OpenMP threads per process (240 total across 60 processes)
Node numbers 1-60 of 60 (fork)
Session stopped (max run-time reached)

I tried increasing --max-run, but to no avail. Something wrong here?

magnumripper commented 9 years ago

I guess the total time for one crypt_all() is longer than the timer abort grace time. I think it's equivalent to pressing 'q' and then press it again after 30 seconds unless session has ended. Could it be that one call is longer than that? Maybe you used many salts? Try only loading one.

magnumripper commented 9 years ago

Posted to john-dev: http://www.openwall.com/lists/john-dev/2015/05/29/13

lei-april commented 9 years ago

I guess the total time for one crypt_all() is longer than the timer abort grace time. I think it's equivalent to pressing 'q' and then press it again after 30 seconds unless session has ended. Could it be that one call is longer than that? Maybe you used many salts? Try only loading one.

I'm cracking only one hash here. And even when I increase --max-run to 5 mins, the results are still all zeroes.

[zhanglei@mic0 zhanglei]$ run/john --format=pbkdf2-hmac-sha256 --mask=?l?l?l?l?l?l?l?l --fork=60 --max-run=300 hash.sha256
Loaded 1 password hash (PBKDF2-HMAC-SHA256 [PBKDF2-SHA256 512/512 MIC 16x])
Will run 4 OpenMP threads per process (240 total across 60 processes)
Node numbers 1-60 of 60 (fork)
41 0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
32 0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
30 0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
52 0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
48 0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
47 0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
49 0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
50 0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
60 0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
Press 'q' or Ctrl-C to abort, almost any other key for status
1 0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
Waiting for 59 children to terminate
Session stopped (max run-time reached)

But self-test doesn't take long for pbkdf2-hmac-sha256.

magnumripper commented 9 years ago

That's very strange. You should debug it and sort out what happens. Maybe your test hash has a lot higher iteration count? Or maybe some bug make it spin for ever.

magnumripper commented 9 years ago

@zzlei be sure to read http://www.openwall.com/lists/john-dev/2015/05/30/6 and follow-ups

magnumripper commented 9 years ago

core i7 (AVX) laptop, 8xOMP/HT, OMP_SCALE=1, gcc 4.9.2

testpara.sh run in bleeding-jumbo

hash\para 1 2 3 4 5
MD4 29963K 41507K 43383K 42435K 41394K
MD5 99008 134336 143809 142848 132960
SHA1 40608 40384 24192 19643 17920
SHA256 1489 1317 1280 1291 1358
SHA512 233 182 182 171 169

testpara.sh run in topic branch "intrinsics-loops"

hash\para 1 2 3 4 5
MD4 30368K 40815K 46460K 45496K 44937K
MD5 98656 138368 150273 150016 125600
SHA1 40832 38848 35904 28134 24480
SHA256 1474 1353 1345 1366 1398
SHA512 233 182 184 179 184
magnumripper commented 9 years ago

SHA1 variants. @jfoug's SHA-1 expansion is ingenious, can't beat that. But lots of redundant temporary space can be dropped!

tweak\para 1 2 3 4 5
tmpR -> w[16] 40672 39360 27264 20096 18851
tmp[i] -> tmp 41120 38590 36192 28544 24712
magnumripper commented 9 years ago

testpara.sh in bleeding-jumbo after c81f637a7, now with assorted things from that topic branch merged.

core i7 (AVX) laptop, 8xOMP/HT, OMP_SCALE=1, gcc 4.9.2

hash\para 1 2 3 4 5
MD4 28092K 39937K 45921K 45297K 43778K
MD5 96352 137216 150432 149248 125440
SHA1 40480 38083 35712 28416 23445
SHA256 1362 1353 1371 1366 1384
SHA512 228 184 184 181 180

I believe the slight regressions seen in SHA-2 are random variations.

magnumripper commented 9 years ago

testpara.sh in bleeding-jumbo after d85f8fd, now with tmp[SIMD_PARA] back (only that change).

core i7 (AVX) laptop, 8xOMP/HT, OMP_SCALE=1, gcc 4.9.2

hash\para 1 2 3 4 5
MD4 29197K 41156K 46727K 46756K 43805K
MD5 89056 138432 150624 149888 117280
SHA1 40672 38208 34788 27776 20480
SHA256 1520 1344 1304 1315 1358
SHA512 228 181 182 169 173
magnumripper commented 9 years ago

Decreased w[80] pad of SHA512 to w[16] using same sliding-window technique as SHA1 & SHA256. Just a little gain... if any. Interestingly enough it had a bad effect on interleaving. Perhaps I should try the opposite with SHA1 and see what happens.

core i7 (AVX) laptop, 8xOMP/HT, OMP_SCALE=1, gcc 4.9.2

hash\para 1 2 3 4 5
SHA512 235 156 168 164 164
magnumripper commented 9 years ago

Tried a few other things

"narrow for loops"

hash\para 1 2 3 4 5
SHA256 1489 869 838 837 806
SHA512 235 137 122 119 125

"drop tmp2"

hash\para 1 2 3 4 5
SHA256 1536 1317 1317 1340 1371
SHA512 230 186 184 182 184

I'm just throwing things at it and see what sticks. This is probably pointless, nothing is conclusive except reading asm output...

magnumripper commented 9 years ago

icc 14.0.0 (super) AVX 32x (NT is non-OMP) NOTE figures fluctuate a lot between runs despite no load on the system.

current bleeding-jumbo (518b8c4e)

hash\para 1 2 3 4 5
MD4 28708K 40725K 43426K 41231K 39213K
MD5 505472 672000 372480 347648 576640
SHA1 86400 86528 44483 45623 46257
SHA256 3513 2816 4942 4970 3200
SHA512 FAILED (cmp_all(1))

I can't reproduce that SHA512 failure elsewhere.

Older code (pre cde0fb47):

hash\para 1 2 3 4 5
MD4 28849K 40491K 43419K 39478K 37279K
MD5 295296 341248 682752 317845 560000
SHA1 91648 88832 68815 37512 33280
SHA256 3764 2681 2671 2844 2990
SHA512 FAILED (cmp_all(1))

Current bleeding, but with per-line loops:

hash\para 1 2 3 4 5
MD4 28570K 39664K 43190K 39783K 38362K
MD5 302592 340736 338688 611840 296554
SHA1 86656 92160 124800 35328 32000
SHA256 3513 1812 2021 1721 1443
SHA512 FAILED (cmp_all(1))
magnumripper commented 9 years ago

Well, AVX2, current code (a277eb6)

gcc version 4.9.2 (GCC)

hash\para 1 2 3 4 5
nt 60519K 83153K 90451K 92775K 84013K
nt-omp 81330K 89391K 88276K 77594K 66355K
md5crypt 56440 84496 93384 91552 79480
md5crypt-omp 325056 385536 394560 368128 320320
pbkdf2-hmac-sha1 25896 25328 23477 15588 14680
pbkdf2-hmac-sha1-omp 109696 99072 89856 59136 57920
pbkdf2-hmac-sha256 934 776 736 745 769
pbkdf2-hmac-sha256-omp 3712 2886 2823 2953 3047
pbkdf2-hmac-sha512 149 120 117 121 119
pbkdf2-hmac-sha512-omp 576 448 436 457 436

With per-line interleaving loops:

hash\para 1 2 3 4 5
nt 60391K 82787K 88486K 87884K 82609K
nt-omp 81559K 88801K 86310K 77725K 69795K
md5crypt 56696 82240 90552 85728 79200
md5crypt-omp 325184 380032 376704 344064 318400
pbkdf2-hmac-sha1 25888 25376 21216 19552 18400
pbkdf2-hmac-sha1-omp 109632 98432 80832 74752 69386
pbkdf2-hmac-sha256 936 887 800 823 653
pbkdf2-hmac-sha256-omp 3738 3355 3200 3200 2509
pbkdf2-hmac-sha512 150 139 126 117 99
pbkdf2-hmac-sha512-omp 576 523 514 449 369

super, AVX, 16xOMP

gcc version 4.8.1 20130715 (Red Hat 4.8.1-4) (GCC)

hash\para 1 2 3 4 5
nt 29035K 40249K 45447K 44843K 35481K
nt-omp 62455K 64356K 62029K 65273K 64225K
md5crypt 24324 37008 41244 40736 27640
md5crypt-omp 338368 512384 569856 577792 393280
pbkdf2-hmac-sha1 11252 11064 7848 6816 4320
pbkdf2-hmac-sha1-omp 160640 159104 112348 98560 57920
pbkdf2-hmac-sha256 400 308 308 323 320
pbkdf2-hmac-sha256-omp 5824 4266 4705 4430 4660
pbkdf2-hmac-sha512 63 43 44 44 43
pbkdf2-hmac-sha512-omp 461 651 628 621 625

With per-line interleaving loops:

hash\para 1 2 3 4 5
nt 29326K 39912K 43238K 42755K 42564K
nt-omp 62357K 62521K 64585K 65667K 63733K
md5crypt 24340 36472 38520 38064 36960
md5crypt-omp 339264 495616 556800 533504 522880
pbkdf2-hmac-sha1 11256 11216 5664 3136 2792
pbkdf2-hmac-sha1-omp 160640 161024 94859 47872 40960
pbkdf2-hmac-sha256 400 160 128 120 140
pbkdf2-hmac-sha256-omp 5760 2316 1864 1774 2018
pbkdf2-hmac-sha512 63 25 20 16 20
pbkdf2-hmac-sha512-omp 909 369 276 253 230

All OpenMP figures are OMP_SCALE 1, except NT.

lei-april commented 9 years ago

How did SHA512 fail self-test? Is it specific with icc?

magnumripper commented 9 years ago

Not sure if it's a compiler bug or something nasty in the code. I haven't seen it anywhere else. This is icc 14.0.0 on "super", maybe you have some other version to try?

lei-april commented 9 years ago

icc 15.0.2 works fine on my Linux VM. I'd guess that's an icc issue, but I'm not sure.

magnumripper commented 9 years ago

Added PBKDF2-HMAC formats for MD4 and MD5 just to make testparas.pl better. And changed number of iterations to 1000 for all of them:

gcc version 5.1.0 (Homebrew gcc5 5.1.0) John the Ripper password cracker, version 1.8.0.6-jumbo-1-bleeding_omp [darwin14.3.0 64-bit AVX-autoconf]

hash\para 1 2 3 4 5
md4 18836 29861 33100 33424 30780
md4-omp 79520 120128 121920 120192 112000
md5 13532 21584 24420 22976 21465
md5-omp 60736 86400 87920 84352 79360
sha1 10736 10952 8928 4032 3740
sha1-omp 41312 39744 34176 19968 19840
sha256 4664 2384 3516 3952 4120
sha256-omp 16736 10560 13782 15207 14891
sha512 1881 839 1290 1512 1524
sha512-omp 6848 3808 4800 5639 5386
lei-april commented 9 years ago

Not sure if it's a compiler bug or something nasty in the code. I haven't seen it anywhere else. This is icc 14.0.0 on "super", maybe you have some other version to try?

This seems a similar issue as https://github.com/magnumripper/JohnTheRipper/issues/1452#issuecomment-113814032 on MIC. I turned off auto-vectorization, and the problem is gone.

lei-april commented 9 years ago

Here's the right result on MIC (I'll delete the previous one):

icc version 14.0.0 (gcc version 4.4.7 compatibility) John the Ripper password cracker, version 1.8.0.6-jumbo-1-bleeding_omp [linux-gnu 64-bit mic-autoconf]

hash\para 1 2 3 4 5
md4 5687 6526 6510 6209 6196
md4-omp 669148 737882 711529 662588 466019
md5 4182 4942 5037 5005 5048
md5-omp 520871 536854 513267 462291 447378
sha1 2598 2321 1411 1415 1346
sha1-omp 282352 253514 180705 173886 163018
sha256 1077 855 830 887 880
sha256-omp 119300 97882 96000 98642 97627
sha512 123 137 154 165 172
sha512-omp 15567 17614 19525 20389 21333
lei-april commented 9 years ago

BTW, we've been stalling at this issue for a while. Which direction should we push in next?

magnumripper commented 9 years ago

Please report those latest figures on john-dev, including a notice we are now benching "PBKDF2-HMAC" even for MD4 and MD5.

Also, let's bring that question to john-dev for discussion. What would you prefer doing now? Actually it might be a good time reviewing the project definition and make a timeline on remaining tasks (also listing what is already done). Please do so, and we'll discuss it on john-dev.

magnumripper commented 9 years ago

@zzlei our mic.h still defaults to para 1 for all hashes. Perhaps we should set it to 2 for MD4/MD5? Then I think we can close this issue for now.

lei-april commented 9 years ago

@magnumripper I agree. I'll change it.