Add interleaving to primary SHA-2 intrinsics functions.

magnumripper commented 9 years ago

This is a GSoC task.

Some is already done: There are SIMD_PARA_SHA256 and SIMD_PARA_SHA512 defined in some header files. And some formats does use them in index calculations (although they are currently hard-coded to 1). I think some formats does not yet. The SHA-2 functions in sse-intrinsics.c needs to have a little code added.

[x] add needed code in sse-intrinsics.c (totally trivial, just mimic SHA-1).
[x] move all existing defines of the SIMD_PARA_SHA256 and SIMD_PARA_SHA512 macros to arch.h (x86-sse.h, x86-64.h and mic.h).
[x] verify all callers honor the macros (with both set to >1 any format not using them should crash or fail).
[x] rebase the old names (eg. MD5_SSE_PARA) to this new style (eg. SIMD_PARA_MD5).
[x] tweak code, verifying output asm for good results.
[x] benchmark good values for defining in arch.h (at least mic.h and x86-64.h).

lei-april commented 9 years ago

Now I added interleaving to SHA512, mostly mimicking SHA1, and it works when SIMD_PARA_SHA512 = 1. I noticed some formats are already using SIMD_PARA_SHA512, e.g. sapH and Office, so I gave them a try. But they failed to work when SIMD_PARA_SHA512 is set other than 1.

I'm not sure whether I've done something wrong when adding interleaving to SHA512, or those formats are not using SIMD_PARA_SHA512 the proper way. Do you have any thoughts?

magnumripper commented 9 years ago

I would guess the formats lack some little detail. If you watch for every mention of SHA1_SSE_PARA and ensure the SHA512 version has a corresponding SIMD_PARA_SHA512, you should be almost set.

But there's also index calculations. They are harder to find because they do not really use the para macro. You need to verify any mention of SIMD_COEF_64 and verify that it does not calculate an index without honoring interleaving. Sunmd5 has both variants - here's the one that does NOT honor interleaving:

#define GETPOS(i, index)            ( (((index)&(SIMD_COEF_32-1))<<2) + (((i)&(0xffffffff-3))*SIMD_COEF_32) + ((i)&3) )

Almost the same macro but honoring interleaving:

#define PARAGETPOS(i, index)        ( (((index)&(SIMD_COEF_32-1))<<2) + (((i)&(0xffffffff-3))*SIMD_COEF_32) + ((i)&3) + (((unsigned int)index/SIMD_COEF_32*SIMD_COEF_32)<<6) )

So, in Office we have a good GETPOS macro (similar to the latter above) but it's only used for byte access. What about the index calculations for 32-bit or 64-bit access?

Line 456

        // Iteration counter in first 4 bytes
        for (j = 0; j < SHA512_LOOP_CNT; j++)
            keys32[j * 2 + j/SIMD_COEF_64*32*SIMD_COEF_64 + 1] = i_be;

The telltale part is j/SIMD_COEF_64*xxxso this one seems to be complete. Same for lines 464-465. Actually I can't spot a single place in Office where it's missing. But I'm really not sure. You might need to add debug prints. If self test says cmp_all(5) failed, dump data for index 4 in some places and verify everything looks like it should.

lei-april commented 9 years ago

I added interleaving to SHA256 and managed to make it work with a few formats. Here're some statistics obtained from experimenting with the interleaving factor (pwsafe, sybasease & aix-ssha256, tested on well):

SIMD_PARA_SHA256 = 1

Benchmarking: pwsafe, Password Safe [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 2048 Raw: 44224 c/s real, 5528 c/s virtual

Benchmarking: sybasease, Sybase ASE [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Many salts: 8421K c/s real, 1052K c/s virtual Only one salt: 7503K c/s real, 935644 c/s virtual

Benchmarking: aix-ssha256, AIX LPA {ssha256} [PBKDF2-SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 64 Raw: 567296 c/s real, 70823 c/s virtual

SIMD_PARA_SHA256 = 2

Benchmarking: pwsafe, Password Safe [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 2048 Raw: 34978 c/s real, 4404 c/s virtual

Benchmarking: sybasease, Sybase ASE [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Many salts: 6815K c/s real, 853034 c/s virtual Only one salt: 6291K c/s real, 791378 c/s virtual

Benchmarking: aix-ssha256, AIX LPA {ssha256} [PBKDF2-SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 64 Raw: 461306 c/s real, 58240 c/s virtual

SIMD_PARA_SHA256 = 4

Benchmarking: pwsafe, Password Safe [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 2048 Raw: 35738 c/s real, 4483 c/s virtual

Benchmarking: sybasease, Sybase ASE [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Many salts: 6810K c/s real, 862958 c/s virtual Only one salt: 6553K c/s real, 857801 c/s virtual

Benchmarking: aix-ssha256, AIX LPA {ssha256} [PBKDF2-SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 64 Raw: 473088 c/s real, 59062 c/s virtual

SIMD_PARA_SHA256 = 8

Benchmarking: pwsafe, Password Safe [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 2048 Raw: 40554 c/s real, 5075 c/s virtual

Benchmarking: sybasease, Sybase ASE [SHA256 256/256 AVX2 8x]... (8xOMP) DONE Many salts: 7453K c/s real, 986015 c/s virtual Only one salt: 7340K c/s real, 983918 c/s virtual

Benchmarking: aix-ssha256, AIX LPA {ssha256} [PBKDF2-SHA256 256/256 AVX2 8x]... (8xOMP) DONE Speed for cost 1 (iteration count) of 64 Raw: 523152 c/s real, 65637 c/s virtual

It seems interleaving doesn't give much help here.

BTW, when adding interleaving to SHA256, I realized that I might have done something wrong with SHA512, which renders it only functional with SIMD_PARA_SHA512 = 1

magnumripper commented 9 years ago

It's a pity it didn't bring any gain. What exact CPU was this, a desktop Haswell? Some older or future CPU (including the MIC, you should definitely try it!) may show better results. Also, SHA512 still might show a gain.

We should implement it fully and commit it anyway (but using para 1 for now) even if we can't find any current CPU which benefits from it. It might eventually get used for non-intel, the pseudo-intrinsics isn't necessarily bound to intel intrinsics.

Maybe here's a plan:

Test this on the MIC too and post the results. BTW you should also post them (as well as the above, or just link to it) to john-dev for Solar to comment.
Commit what you've got now (if it's ready for it) but obviously keep the para defined to 1. Me and others can help out fixing the rest of the SHA256 formats sooner or later.
Finish SHA512 (at least one or two formats) and test this too on AVX2 and on the MIC, and post the results. And commit that too.

lei-april commented 9 years ago

The previous experimentation was done on well, so it's Haswell.

Some formats don't work with the new SHA256 & SHA512 at the moment, so I don't thinks it's a good idea to commit to bleeding-jumbo. I'm currently working on another temporary branch interleaving in my repo. Is there some branch in the public repo to commit unstable code? unstable-jumbo looks like one.

magnumripper commented 9 years ago

No, unstable-jumbo is an old branch based on core 1.7.9

But there should be no problem at all committing it as long as SIMD_PARA_SHAXX is defined to 1, right?

lei-april commented 9 years ago

But there should be no problem at all committing it as long as SIMD_PARA_SHAXX is defined to 1, right?

Yes, that's right. I got it.

magnumripper commented 9 years ago

After 1629e65 no formats segfault for me even with ASan, but several still fails (using para 2). Now only a bunch of details left to fix :laughing:

magnumripper commented 9 years ago

@jfoug @zzlei I fail to see what is wrong with some SHA512 formats when trying with PARA 2 or 3. A very good example is Drupal7. Look at git diff b96ed88fc5009^ drupal7_fmt_plug.c. Very simple fixes, I really can't see anything missing. It should work. So I've been looking into SHA512 in sse-intrinsics.c but can't see anything wrong there either.

Also, all formats using pbkdf2_hmac_sha512.h fails. But there's nothing wrong with it!?

I give them up for now and concentrate on trivial fixes for SHA224/256 for a while.

magnumripper commented 9 years ago

The only SHA256 format that fails is raw-sha256. All the rest are SHA512.

Cloudkeychain is PBKDF2-SHA512 but it doesn't use the shared function, it has a copy of its own. And that one, for some reason, works (or rather, it passes self-test. The tests doesn't catch all bugs).

magnumripper commented 9 years ago

After e229de81, all SHA256 formats pass the Test Suite. Most or all SHA512 formats fail (they might pass self-test but not the Test Suite).

I'm pretty sure that raw-sha512 (among others) are 100% right now (after 885c3cba5) but they still fail. I have been staring at sse-intrinsics.c a lot but can't see any problem there either.

magnumripper commented 9 years ago

On a side note I'm seeing good results for interleaving SHA256 on Haswell core i7 AVX2 (4790, gcc 4.8.2, 8xOMP w/ HT)

Raw-SHA256

1 34325K
2 32768K
3 35979K
4 38535K (+12%)

Raw-SHA384 (buggy code though - result may change)

With older core i7 mobile, AVX gcc 4.9.2 8xOMP w/ HT, I see no gain from interleaving (but loss).

magnumripper commented 9 years ago

Gotcha.

@zzlei you did right writing it like this

    SHA512_PARA_DO(i) memcpy(w[i], &data[i * 16], 16 * sizeof(vtype));

I was confused by the fact SHA512 has a different way of "expanding" the buffer from 16 to 80 so I erroneously changed it (somewhat mislead by your comment "something's not right here" which turns out to be incorrect - it was 100% right).

magnumripper commented 9 years ago

After ce1f6e4, all formats pass self-test. Now on to Test Suite.

magnumripper commented 9 years ago

$ OMP_NUM_THREADS=3 ./jtrts.pl sha2 -q
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.13, Dec 21, 2014.  By, Jim Fougeron & others
- Testing:  John the Ripper password cracker, version 1.8.0.4-jumbo-1-bleeding_omp [darwin14.3.0 64-bit AVX-autoconf]
--------------------------------------------------------------------------------
Warning: SAP-B format should never be UTF-8.
Use --target-encoding=iso-8859-1 or whatever is applicable.
All tests passed without error.  Performed 35 tests.  Time used was 213 seconds

lei-april commented 9 years ago

SHA512_PARA_DO(i) memcpy(w[i], &data[i * 16], 16 * sizeof(vtype));

Well, when I first tested the interleaved SHA512, every format which takes the above code path failed self-test. I guess that statement is somehow incorrect, but memcpy(w, &data, 16 * sizeof(vtype) *SIMD_PARA_SHA512) doesn't seem right to me, either. So I just left that comment. Sorry for misleading you :)

magnumripper commented 9 years ago

It does make sense since we have w[80] and something like data[16]. The reasons they failed self-test were other things, now fixed.

magnumripper commented 9 years ago

On a different note, we might want to try interleaving differently. Here's part of SHA1 (MD4 and MD5 are made similarly):

#define SHA1_ROUND2(a,b,c,d,e,F,t) \
    SHA1_PARA_DO(i) tmp3[i] = tmpR[i*16+(t&0xF)]; \
    SHA1_EXPAND2(t+16) \
    F(b,c,d) \
    SHA1_PARA_DO(i) e[i] = vadd_epi32( e[i], tmp[i] ); \
    SHA1_PARA_DO(i) tmp[i] = vroti_epi32(a[i], 5); \
    SHA1_PARA_DO(i) e[i] = vadd_epi32( e[i], tmp[i] ); \
    SHA1_PARA_DO(i) e[i] = vadd_epi32( e[i], cst ); \
    SHA1_PARA_DO(i) e[i] = vadd_epi32( e[i], tmp3[i] ); \
    SHA1_PARA_DO(i) b[i] = vroti_epi32(b[i], 30);

And here's how it goes in SHA256 and SHA512

#define SHA256_STEP0(a,b,c,d,e,f,g,h,x,K)                    \
{                                                            \
    SHA256_PARA_DO(i)                                        \
    {                                                        \
        w = _w[i].w;                                         \
        tmp1[i] = vadd_epi32(h[i],    S1(e[i]));             \
        tmp1[i] = vadd_epi32(tmp1[i], Ch(e[i],f[i],g[i]));   \
        tmp1[i] = vadd_epi32(tmp1[i], vset1_epi32(K));       \
        tmp1[i] = vadd_epi32(tmp1[i], w[x]);                 \
        tmp2[i] = vadd_epi32(S0(a[i]),Maj(a[i],b[i],c[i]));  \
        d[i]    = vadd_epi32(tmp1[i], d[i]);                 \
        h[i]    = vadd_epi32(tmp1[i], tmp2[i]);              \
    }                                                        \
}

One difference is that the former is almost guaranteed to be fully unrolled why the latter might not.

magnumripper commented 9 years ago

Another thing I realized is SHA512 has a larger footprint. It uses w[80] throughout the function while SHA256 uses w[16] and handles the expansion in STEP_R. We should try changing that and see what happens. I think we should discuss both these things with Solar, he usually can tell by heart what it would mean to caches and pipelines :)

lei-april commented 9 years ago

One difference is that the former is almost guaranteed to be fully unrolled why the latter might not.

I did notice the difference when I wrote it. But honestly I didn't get the point of wrapping each statement with a for clause. How does that help unrolling?

magnumripper commented 9 years ago

Maybe it gets unrolled anyway... but unrolled or not, the order of instructions ends up different (or rather, we leave more to the optimizer - which may be good or bad).

Current SHA1:

    foo(w[0]);
    foo(w[1]);
    foo(w[2]);
    foo(w[3]);
    bar(w[0]);
    bar(w[1]);
    bar(w[2]);
    bar(w[3]);

Current SHA2:

    foo(w[0]);
    bar(w[0]);
    foo(w[1]);
    bar(w[1]);
    foo(w[2]);
    bar(w[2]);
    foo(w[3]);
    bar(w[3]);

Assuming 'foo' is actually a load and 'bar' is an operation, the difference might be significant. OTOH assuming a good optimizer, this will be shuffled around to the best anyway.

lei-april commented 9 years ago

After ce1f6e4, all formats pass self-test. Now on to Test Suite.

Awesome. Guess I can start benchmarking for MIC now?

magnumripper commented 9 years ago

Please do. I think you should now use pbkdf2-hmac-sha256 and pbkdf2-hmac-sha512 for benching.

For each "para" between 1 and 5 (or more), try using 240 threads as well as 60 forks x 4 threads. And perhaps even using no threads. I have a feeling you should also try decreasing OMP_SCALE to 1 for pbkdf2-hmac-sha256 (it is 1 already for pbkdf2-hmac-sha512).

Please state exact compiler version when documenting, and post results to john-dev.

magnumripper commented 9 years ago

BTW the positive results I saw for Haswell seem to be gone now.

i7-4790 (HT) AVX2 gcc 4.8.2

hash\para	1	2	3	4	5
SHA256	950	784	752	768	784
SHA512	143	136	106	107	113
8x SHA256	3692	2742	2671	2898	2931
8x SHA512	533	506	390	426	400

E5-2670 (HT) AVX gcc 4.8.2

hash\para	1	2	3	4	5
SHA256	320	248	252	264	269
SHA512	47.0	33.6	34.6	33.0	36.0
14x SHA256	2661	2018	2067	2036	2133
14x SHA512	392	316	305	268	285

AMD 8435 (six-core) SSE2 gcc 4.8.2

hash\para	1	2	3	4	5
SHA256	208	196	192	192	194
SHA512	33.6	32.4	25.4	26.8	25.6
6x SHA256	1235	1168	1152	1163	1161
6x SHA512	198	194	152	161	153

lei-april commented 9 years ago

240 threads on MIC:

pbkdf2-hmac-sha256 x1 Raw: 9142 c/s real, 38.0 c/s virtual x2 Raw: 6592 c/s real, 27.9 c/s virtual x3 Raw: 7078 c/s real, 29.6 c/s virtual x4 Raw: 7253 c/s real, 30.4 c/s virtual x5 Raw: 7305 c/s real, 30.5 c/s virtual

pbkdf2-hmac-sha512 x1 Raw: 426 c/s real, 1.8 c/s virtual x2 Raw: 474 c/s real, 1.9 c/s virtual x3 Raw: 482 c/s real, 2.0 c/s virtual x4 Raw: 506 c/s real, 2.1 c/s virtual x5 Raw: 509 c/s real, 2.1 c/s virtual

BTW, OMP_SCALE = 1 and OMP_SCALE = 4 has nearly the same performance on MIC.

lei-april commented 9 years ago

Somehow I couldn't get useful info from a forked run of pbkdf2-hmac-sha256/512. I used the same settings as I benchmarked raw-md4(5), but only got output like:

Will run 4 OpenMP threads per process (240 total across 60 processes)
Node numbers 1-60 of 60 (fork)
Session stopped (max run-time reached)

I tried increasing --max-run, but to no avail. Something wrong here?

magnumripper commented 9 years ago

I guess the total time for one crypt_all() is longer than the timer abort grace time. I think it's equivalent to pressing 'q' and then press it again after 30 seconds unless session has ended. Could it be that one call is longer than that? Maybe you used many salts? Try only loading one.

magnumripper commented 9 years ago

Posted to john-dev: http://www.openwall.com/lists/john-dev/2015/05/29/13

lei-april commented 9 years ago

I guess the total time for one crypt_all() is longer than the timer abort grace time. I think it's equivalent to pressing 'q' and then press it again after 30 seconds unless session has ended. Could it be that one call is longer than that? Maybe you used many salts? Try only loading one.

I'm cracking only one hash here. And even when I increase --max-run to 5 mins, the results are still all zeroes.

[zhanglei@mic0 zhanglei]$ run/john --format=pbkdf2-hmac-sha256 --mask=?l?l?l?l?l?l?l?l --fork=60 --max-run=300 hash.sha256
Loaded 1 password hash (PBKDF2-HMAC-SHA256 [PBKDF2-SHA256 512/512 MIC 16x])
Will run 4 OpenMP threads per process (240 total across 60 processes)
Node numbers 1-60 of 60 (fork)
41 0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
32 0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
30 0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
52 0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
48 0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
47 0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
49 0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
50 0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
60 0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
Press 'q' or Ctrl-C to abort, almost any other key for status
1 0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
Waiting for 59 children to terminate
Session stopped (max run-time reached)

But self-test doesn't take long for pbkdf2-hmac-sha256.

magnumripper commented 9 years ago

That's very strange. You should debug it and sort out what happens. Maybe your test hash has a lot higher iteration count? Or maybe some bug make it spin for ever.

magnumripper commented 9 years ago

@zzlei be sure to read http://www.openwall.com/lists/john-dev/2015/05/30/6 and follow-ups

magnumripper commented 9 years ago

core i7 (AVX) laptop, 8xOMP/HT, OMP_SCALE=1, gcc 4.9.2

testpara.sh run in bleeding-jumbo

hash\para	1	2	3	4	5
MD4	29963K	41507K	43383K	42435K	41394K
MD5	99008	134336	143809	142848	132960
SHA1	40608	40384	24192	19643	17920
SHA256	1489	1317	1280	1291	1358
SHA512	233	182	182	171	169

testpara.sh run in topic branch "intrinsics-loops"

hash\para	1	2	3	4	5
MD4	30368K	40815K	46460K	45496K	44937K
MD5	98656	138368	150273	150016	125600
SHA1	40832	38848	35904	28134	24480
SHA256	1474	1353	1345	1366	1398
SHA512	233	182	184	179	184

magnumripper commented 9 years ago

SHA1 variants. @jfoug's SHA-1 expansion is ingenious, can't beat that. But lots of redundant temporary space can be dropped!

tweak\para	1	2	3	4	5
tmpR -> w[16]	40672	39360	27264	20096	18851
tmp[i] -> tmp	41120	38590	36192	28544	24712

magnumripper commented 9 years ago

testpara.sh in bleeding-jumbo after c81f637a7, now with assorted things from that topic branch merged.

core i7 (AVX) laptop, 8xOMP/HT, OMP_SCALE=1, gcc 4.9.2

hash\para	1	2	3	4	5
MD4	28092K	39937K	45921K	45297K	43778K
MD5	96352	137216	150432	149248	125440
SHA1	40480	38083	35712	28416	23445
SHA256	1362	1353	1371	1366	1384
SHA512	228	184	184	181	180

I believe the slight regressions seen in SHA-2 are random variations.

magnumripper commented 9 years ago

testpara.sh in bleeding-jumbo after d85f8fd, now with tmp[SIMD_PARA] back (only that change).

core i7 (AVX) laptop, 8xOMP/HT, OMP_SCALE=1, gcc 4.9.2

hash\para	1	2	3	4	5
MD4	29197K	41156K	46727K	46756K	43805K
MD5	89056	138432	150624	149888	117280
SHA1	40672	38208	34788	27776	20480
SHA256	1520	1344	1304	1315	1358
SHA512	228	181	182	169	173

magnumripper commented 9 years ago

Decreased w[80] pad of SHA512 to w[16] using same sliding-window technique as SHA1 & SHA256. Just a little gain... if any. Interestingly enough it had a bad effect on interleaving. Perhaps I should try the opposite with SHA1 and see what happens.

core i7 (AVX) laptop, 8xOMP/HT, OMP_SCALE=1, gcc 4.9.2

hash\para	1	2	3	4	5
SHA512	235	156	168	164	164

magnumripper commented 9 years ago

Tried a few other things

"narrow for loops"

hash\para	1	2	3	4	5
SHA256	1489	869	838	837	806
SHA512	235	137	122	119	125

"drop tmp2"

hash\para	1	2	3	4	5
SHA256	1536	1317	1317	1340	1371
SHA512	230	186	184	182	184

I'm just throwing things at it and see what sticks. This is probably pointless, nothing is conclusive except reading asm output...

magnumripper commented 9 years ago

icc 14.0.0 (super) AVX 32x (NT is non-OMP) NOTE figures fluctuate a lot between runs despite no load on the system.

current bleeding-jumbo (518b8c4e)

hash\para	1	2	3	4	5
MD4	28708K	40725K	43426K	41231K	39213K
MD5	505472	672000	372480	347648	576640
SHA1	86400	86528	44483	45623	46257
SHA256	3513	2816	4942	4970	3200
SHA512	`FAILED (cmp_all(1))`

I can't reproduce that SHA512 failure elsewhere.

Older code (pre cde0fb47):

hash\para	1	2	3	4	5
MD4	28849K	40491K	43419K	39478K	37279K
MD5	295296	341248	682752	317845	560000
SHA1	91648	88832	68815	37512	33280
SHA256	3764	2681	2671	2844	2990
SHA512	`FAILED (cmp_all(1))`

Current bleeding, but with per-line loops:

hash\para	1	2	3	4	5
MD4	28570K	39664K	43190K	39783K	38362K
MD5	302592	340736	338688	611840	296554
SHA1	86656	92160	124800	35328	32000
SHA256	3513	1812	2021	1721	1443
SHA512	`FAILED (cmp_all(1))`

magnumripper commented 9 years ago

Well, AVX2, current code (a277eb6)

gcc version 4.9.2 (GCC)

hash\para	1	2	3	4	5
nt	60519K	83153K	90451K	92775K	84013K
nt-omp	81330K	89391K	88276K	77594K	66355K
md5crypt	56440	84496	93384	91552	79480
md5crypt-omp	325056	385536	394560	368128	320320
pbkdf2-hmac-sha1	25896	25328	23477	15588	14680
pbkdf2-hmac-sha1-omp	109696	99072	89856	59136	57920
pbkdf2-hmac-sha256	934	776	736	745	769
pbkdf2-hmac-sha256-omp	3712	2886	2823	2953	3047
pbkdf2-hmac-sha512	149	120	117	121	119
pbkdf2-hmac-sha512-omp	576	448	436	457	436

With per-line interleaving loops:

hash\para	1	2	3	4	5
nt	60391K	82787K	88486K	87884K	82609K
nt-omp	81559K	88801K	86310K	77725K	69795K
md5crypt	56696	82240	90552	85728	79200
md5crypt-omp	325184	380032	376704	344064	318400
pbkdf2-hmac-sha1	25888	25376	21216	19552	18400
pbkdf2-hmac-sha1-omp	109632	98432	80832	74752	69386
pbkdf2-hmac-sha256	936	887	800	823	653
pbkdf2-hmac-sha256-omp	3738	3355	3200	3200	2509
pbkdf2-hmac-sha512	150	139	126	117	99
pbkdf2-hmac-sha512-omp	576	523	514	449	369

super, AVX, 16xOMP

gcc version 4.8.1 20130715 (Red Hat 4.8.1-4) (GCC)

hash\para	1	2	3	4	5
nt	29035K	40249K	45447K	44843K	35481K
nt-omp	62455K	64356K	62029K	65273K	64225K
md5crypt	24324	37008	41244	40736	27640
md5crypt-omp	338368	512384	569856	577792	393280
pbkdf2-hmac-sha1	11252	11064	7848	6816	4320
pbkdf2-hmac-sha1-omp	160640	159104	112348	98560	57920
pbkdf2-hmac-sha256	400	308	308	323	320
pbkdf2-hmac-sha256-omp	5824	4266	4705	4430	4660
pbkdf2-hmac-sha512	63	43	44	44	43
pbkdf2-hmac-sha512-omp	461	651	628	621	625

With per-line interleaving loops:

hash\para	1	2	3	4	5
nt	29326K	39912K	43238K	42755K	42564K
nt-omp	62357K	62521K	64585K	65667K	63733K
md5crypt	24340	36472	38520	38064	36960
md5crypt-omp	339264	495616	556800	533504	522880
pbkdf2-hmac-sha1	11256	11216	5664	3136	2792
pbkdf2-hmac-sha1-omp	160640	161024	94859	47872	40960
pbkdf2-hmac-sha256	400	160	128	120	140
pbkdf2-hmac-sha256-omp	5760	2316	1864	1774	2018
pbkdf2-hmac-sha512	63	25	20	16	20
pbkdf2-hmac-sha512-omp	909	369	276	253	230

All OpenMP figures are OMP_SCALE 1, except NT.

lei-april commented 9 years ago

How did SHA512 fail self-test? Is it specific with icc?

magnumripper commented 9 years ago

Not sure if it's a compiler bug or something nasty in the code. I haven't seen it anywhere else. This is icc 14.0.0 on "super", maybe you have some other version to try?

lei-april commented 9 years ago

icc 15.0.2 works fine on my Linux VM. I'd guess that's an icc issue, but I'm not sure.

magnumripper commented 9 years ago

Added PBKDF2-HMAC formats for MD4 and MD5 just to make testparas.pl better. And changed number of iterations to 1000 for all of them:

gcc version 5.1.0 (Homebrew gcc5 5.1.0) John the Ripper password cracker, version 1.8.0.6-jumbo-1-bleeding_omp [darwin14.3.0 64-bit AVX-autoconf]

hash\para	1	2	3	4	5
md4	18836	29861	33100	33424	30780
md4-omp	79520	120128	121920	120192	112000
md5	13532	21584	24420	22976	21465
md5-omp	60736	86400	87920	84352	79360
sha1	10736	10952	8928	4032	3740
sha1-omp	41312	39744	34176	19968	19840
sha256	4664	2384	3516	3952	4120
sha256-omp	16736	10560	13782	15207	14891
sha512	1881	839	1290	1512	1524
sha512-omp	6848	3808	4800	5639	5386

lei-april commented 9 years ago

Not sure if it's a compiler bug or something nasty in the code. I haven't seen it anywhere else. This is icc 14.0.0 on "super", maybe you have some other version to try?

This seems a similar issue as https://github.com/magnumripper/JohnTheRipper/issues/1452#issuecomment-113814032 on MIC. I turned off auto-vectorization, and the problem is gone.

lei-april commented 9 years ago

Here's the right result on MIC (I'll delete the previous one):

icc version 14.0.0 (gcc version 4.4.7 compatibility) John the Ripper password cracker, version 1.8.0.6-jumbo-1-bleeding_omp [linux-gnu 64-bit mic-autoconf]

hash\para	1	2	3	4	5
md4	5687	6526	6510	6209	6196
md4-omp	669148	737882	711529	662588	466019
md5	4182	4942	5037	5005	5048
md5-omp	520871	536854	513267	462291	447378
sha1	2598	2321	1411	1415	1346
sha1-omp	282352	253514	180705	173886	163018
sha256	1077	855	830	887	880
sha256-omp	119300	97882	96000	98642	97627
sha512	123	137	154	165	172
sha512-omp	15567	17614	19525	20389	21333

lei-april commented 9 years ago

BTW, we've been stalling at this issue for a while. Which direction should we push in next?

magnumripper commented 9 years ago

Please report those latest figures on john-dev, including a notice we are now benching "PBKDF2-HMAC" even for MD4 and MD5.

Also, let's bring that question to john-dev for discussion. What would you prefer doing now? Actually it might be a good time reviewing the project definition and make a timeline on remaining tasks (also listing what is already done). Please do so, and we'll discuss it on john-dev.

magnumripper commented 9 years ago

@zzlei our mic.h still defaults to para 1 for all hashes. Perhaps we should set it to 2 for MD4/MD5? Then I think we can close this issue for now.

lei-april commented 9 years ago

@magnumripper I agree. I'll change it.

openwall / john