Closed magnumripper closed 5 years ago
@solardiz we can do that and assign any number to yescrypt (to make it known) in function c3_subformat_algorithm(). Not sure what to report as the "second" tunable cost for yescrypt. (So far, we only report one additional tunable cost per algorithm in the generic crypt format.)
I now dropped the requirement of passing build-bots since they are too slow. Merge responsibly 😄.
@frank-dittrich I think we shouldn't over-complicate the generic crypt(3) format for these hashes just yet. I'd be fine with not reporting their parameters, and not even detecting matching salts (although that's fairly easy to add). I just wanted to know that it works with these hashes fine when treating them as unknowns - I'm happy that it does. And it should probably in fact list "0:unknown" for such cases.
If you'd like to make them known, then I propose number 7 for scrypt and 8 for yescrypt, and don't report their other tunable costs just yet (we need to do that in specialized formats).
I'm up to full speed now, although still stumped by #3735. I've come to understand quite a bit more of Sayantan's cryptic code, I'm sure I will nail it tomorrow. Then hopefully I'll manage to tick off most other issues assigned to me, hopefully including #3489, in short time.
Everyone should take a stab at some formats in #3091 - it's trivial and there's a HOW-TO for it!
benchmark-unify will probably need an adjustment, to match old format names to format names currently in use.
The last adjustment of mappings I did was commit 2e3cd50d772cfb52219c0796c43f4396f071f3df from April, 2015, when I mapped NT-old to NT.
@magnumripper Great! I think we'll release on April 8 with whatever we have by then, so let's try to bring our tree in as much of a shape for the release as we reasonably can.
@frank-dittrich I'd appreciate it if you help bring benchmark-unify
up to date. Thanks!
@solardiz I think the previous jumbo release was 1.8.0.6-jumbo-1
.
I checked out 1.8.0.6-jumbo-1 and tried to build it, to compare these format names with current format names, but I failed to build that version.
make[1]: *** No rule to make target 'dynamic_big_crypt.c', needed by 'dynamic_big_crypt.o'. Stop.
make[1]: Leaving directory '/home/fd/git/JtR/src'
make: *** [Makefile:181: default] Error 2
ideas?
@frank-dittrich No, the previous jumbo release was 1.8.0-jumbo-1. You can download a tarball of it off JtR homepage. I don't know what 1.8.0.6-jumbo-1 is - I guess some tag in this repo? Anyway, it was never released as such, it's just one of many intermediary version numbers we had.
1.8.0-jumbo-6 was a tag that meanwhile has been removed from magnum's git repo.
I checked out 1.8.0-jumbo-1, got some syntax errors as well. (Doesn't prevent me from mapping in general, but will most likely cause problems when benchmarking that old release with newer tool chains.)
What to do in cases like these: EFS has been dropped in favour of DPAPImk, but the 2 EFS self tests (with lower iteration count) now are the last 2 self tests of DPAPImk, so they don't matter in benchmarks.
@frank-dittrich I think it's sufficient that you implement mapping for trivial renames, ignoring any non-trivial cases (we won't be able to have relbench compare results for those, but that's OK'ish).
Unfortunately I got much less time than planned to work on Jumbo last few days, hopefully I can spend several hours tonight and many hours tomorrow with wrapping things up.
You have some/a lot of administrative duties. Plus testing, after stabilization.
Do not overload yourself.
My tentative plan is to finalize 1.9.0 core today, and 1.9.0-jumbo-1 tomorrow. (Yes, magnum will need to merge the changes from core to jumbo in time.)
For the benchmarks length issue, I am considering making BENCHMARK_LENGTH
actually mean just that. Right now, it's on what length we split Short vs. Long password benchmarks, but that's a little-used feature - maybe we can enable that (in rare cases where it's needed) e.g. by specifying the corresponding negative number. We can then have a john.conf
setting enabling old-style benchmarks, where all test vectors' plaintexts are used as-is - this will be solely for use with relbench
against the previous releases. This should be a fairly easy change in core. We need to evaluate whether it's also an easy change for jumbo. I think it might be - just revert some previous jumbo changes to bench.c
, and we'll also be able to revert/exclude the benchmarking with mask by default if we have to do that for now (because of the bugs this uncovered) yet standardize on length 7 for benchmarks using code that would come from core. I welcome comments on this idea, especially from magnum. Thanks!
Looks like it should be easy to adapt jumbo to my potential redefinition of benchmark_length
semantics. The only use of values other than 0 and -1 is inherited from core:
$ cat *.[ch] | grep '^#define.*BENCHMARK_LENGTH' | sed -r 's/[[:space:]]+/ /g' | sort -u
#define BENCHMARK_LENGTH -1
#define BENCHMARK_LENGTH -1 /* change to 0 once there's any speedup for "many salts" */
#define BENCHMARK_LENGTH -1 // only 1 salt is pretty much same speed.
#define BENCHMARK_LENGTH 0
#define BENCHMARK_LENGTH 8
#define NSLDAP_BENCHMARK_LENGTH -1
#define WINZIP_BENCHMARK_LENGTH -1
#define XSHA512_BENCHMARK_LENGTH 0
The benchmark_length ideas sounds good to me.
Great. I am working on some other (really minor) core changes now (cleanups, extra make targets), and will likely approach its benchmarks rework next.
I'm focusing fully at #3764 / #3249 / #3697 until it's either fixed or we opt to just revert the changes for now (I'm hoping not to end up reverting).
Yes, that's what we need now, @magnumripper. Thank you!
Meanwhile, I've just pushed these changes to core:
1.8.0.17:
- Always use 3x interleaving for bcrypt on x86-64.
- Added linux-arm64le and linux-sparc64 make targets.
I created core's arm64le.h
anew, not based on jumbo's but rather based on my own testing/tuning on two different AArch64 CPUs. I don't know whether/how jumbo's differs (haven't looked yet), but I suspect it might differ in common definitions between core and jumbo (if so, core's should prevail) and in having jumbo additions (those should be re-added on top of core's base file).
My further changes to core before the release will likely be limited to the benchmarks and to documentation edits.
I'm afraid the non-mask self-test, mask-benchmark stuff will have to be reverted. I could try fixing all formats (would take a few hours) but we'll need a fair amount of testing. Should I go for that or revert?
@magnumripper I'm fine with reverting these. Please have -test -mask
print a warning that it doesn't actually run self-tests, then. And expect to merge my benchmark changes from core, so that we'll standardize on length 7 benchmarks by default in that other way for now. Thanks!
@magnumripper I've just pushed my benchmark changes to core. Please merge. Thanks!
Overview
[1] on powerpc Altivec, 32-bit BE
dynamic_assign_script_to_format() the dynamic.cmp_all/cmp_one() failed. This expression can not be handled by john!
Benchmarking: dynamic=md5($p) [Dynamic RDP]... FAILED (cmp_all(1))
[2] it is afl: we were running without failures for a couple of months. Maybe a year.
[*] Fuzzing test case #295 (299 total, 1 uniq crashes found)...
I'll concentrate on #3091 now until it's finished or until it has to yield for something of higher prio
So I wanted to release no later than today, but we're still desperately making important "last-minute" changes, and then some (re-)testing will be needed. I think it'd be unwise to release now just for the sake of meeting the previously set schedule. So let's continue for a few days more, and release when we feel we're sort of ready.
@claudioandre-br Thank you for that "Overview". I'd appreciate it if you look into that afl crash, if it's still happening.
@frank-dittrich Thank you for the relbench-related fixes. After @magnumripper is done with #3091, I'd appreciate it if you (or/and any other volunteer[s]? @AlekseyCherepanov maybe?) also actually run relbench, perhaps separately for --disable-openmp
and --enable-openmp-for-fast-formats
builds - just these two - on an otherwise idle system and/or setting OMP_NUM_THREADS to slightly fewer than you have CPUs. Please use these settings to make the benchmarks comparable to what we had in 1.8.0-jumbo-1 (yes, make two builds of that old version as well):
[Debug]
Benchmarks_1_8 = Y
There's going to be a password cracking contest in the next few days, with @AlekseyCherepanov leading team john-users. I think this will provide some testing to whatever state bleeding-jumbo will be in by and during that time. https://crackthecon.com "The contest will be running from April 10 - 20:30 to April 12 - 20:30 UTC"
I will not pick any more tasks related to release.
So let's continue for a few days more
Sure. Just a heads-up: I will be almost completely off grid from this Friday evening, and for two weeks, give or take. AFAIK I will be able to contribute a good amount of time up to then though.
Testing current Jumbo on 32-bit and 64-bit Raspberry Pi's. Autoconf and building works fine. --disable-simd
-builds work fine too. All ASIMD formats (that I looked at) are slower than NEON and all 64-bit non-SIMD are slower than the 32-bit ones. I was mostly concerned about config/build though and some results may be low due to thermal throttling.
32-bit with NEON
Benchmarking: descrypt, traditional crypt(3) [DES 128/128 NEON]... (4xOMP) DONE
Warning: "Many salts" test limited: 109/256
Many salts: 1785K c/s real, 445350 c/s virtual
Only one salt: 1490K c/s real, 372736 c/s virtual
Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 NEON 4x2]... (4xOMP) DONE
Raw: 19840 c/s real, 4960 c/s virtual
Benchmarking: NT [MD4 128/128 NEON 4x2]... DONE
Raw: 5667K c/s real, 5667K c/s virtual
32-bit without NEON
Benchmarking: descrypt, traditional crypt(3) [DES 32/32]... (4xOMP) DONE
Warning: "Many salts" test limited: 171/256
Many salts: 700416 c/s real, 175104 c/s virtual
Only one salt: 636704 c/s real, 159571 c/s virtual
Benchmarking: md5crypt, crypt(3) $1$ [MD5 32/32 X2]... (4xOMP) DONE
Raw: 11815 c/s real, 2953 c/s virtual
Benchmarking: NT [MD4 32/32]... DONE
Raw: 1532K c/s real, 1532K c/s virtual
64-bit with ASIMD
Benchmarking: descrypt, traditional crypt(3) [DES 128/128 ASIMD]... (4xOMP) DONE
Warning: "Many salts" test limited: 62/256
Many salts: 1005K c/s real, 251437 c/s virtual
Only one salt: 859754 c/s real, 214938 c/s virtual
Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 ASIMD 4x2]... (4xOMP) DONE
Raw: 11659 c/s real, 2914 c/s virtual
Benchmarking: NT [MD4 128/128 ASIMD 4x2]... DONE
Raw: 3457K c/s real, 3457K c/s virtual
64-bit without ASIMD
Benchmarking: descrypt, traditional crypt(3) [DES 64/64]... (4xOMP) DONE
Warning: "Many salts" test limited: 109/256
Many salts: 892928 c/s real, 222121 c/s virtual
Only one salt: 778240 c/s real, 193592 c/s virtual
Benchmarking: md5crypt, crypt(3) $1$ [MD5 32/64 X2]... (4xOMP) DONE
Raw: 6699 c/s real, 1670 c/s virtual
Benchmarking: NT [MD4 32/64]... DONE
Raw: 998656 c/s real, 998656 c/s virtual
@magnumripper What compiler/version did you use? Is it "the same" for 32-bit and 64-bit (just different options) or different versions/builds? I suspect the slowdown with 64-bit is a compiler shortcoming. My other guess is the CPU might use a lower clock rate when running 64-bit code.
Edit: I'd also compare single-threaded benchmarks, to make throttling less likely in those, even though for actual use on this very system of course performance with all cores in use matters.
Both are gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)
(well the 32-bit one actually says gcc version 6.3.0 20170516 (Raspbian 6.3.0-18+rpi1+deb9u1)
)
The NT benchmarks are single-thread and CPU was probably fairly cool in all four cases (some of the others were run as --format=*crypt
but that one was alone). Yet the 64-bit versions are significantly slower.
@frank-dittrich Now that @magnumripper is done with #3091, I'd appreciate it if you (or/and any other volunteer[s]?) run relbench
, perhaps separately for --disable-openmp
and --enable-openmp-for-fast-formats
builds - just these two - on an otherwise idle system and/or setting OMP_NUM_THREADS
to slightly fewer than you have CPUs. Please use these settings to make the benchmarks comparable to what we had in 1.8.0-jumbo-1 (yes, make two builds of that old version as well):
[Debug]
Benchmarks_1_8 = Y
@solardiz should I tweak autoconf to add -fno-strict-aliasing
to arm64, but not to arm32?
...or perhaps rather to ASIMD but not to NEON?
@magnumripper I think so, yes - unless you got warnings in your 32-bit builds? It appears that there's presumed aliasing potential between SIMD vectors and certain vector component types, so we're safe with non-aliasing-based optimizations enabled when we use one of those combinations. On x86, this appears to be between SIMD and both 32- and 64-bit vector component types. On Aarch64, it appears the compiler assumes SIMD can't alias 64-bit components, but I think on 32-bit ARM it can alias 32-bit components presumably because of how ARM's spec is written with NEON being vectors of 32-bit.
Oh, I see now, it got rid of the dreaded warnings. But yes, I get the warnings for ARM32 also. So I'll add it for ASIMD and NEON but not for non-SIMD builds then.
Confirmed again, and done. I believe you should add it to core Makefile's linux-arm32le-neon
target as well.
OK, "Added -fno-strict-aliasing to OPT_INLINE for linux-arm32le-neon as well", but I haven't tested this. Can you help test core on your RPi as well, using all 3 ARM targets? Thanks!
@magnumripper Your jumbo autoconf change looks like it adds -fno-strict-aliasing
to CFLAGS rather than only to OPT_INLINE. This may have performance impact on more than has to be impacted to workaround the issue we have in DES_bs_b.c.
We could also try putting:
#ifdef __GNUC__
#pragma GCC optimize ("no-strict-aliasing")
#endif
just in the affected sections of DES_bs_b.c (which I think are NEON/ASIMD and AltiVec).
@magnumripper Your jumbo autoconf change looks like it adds
-fno-strict-aliasing
to CFLAGS rather than only to OPT_INLINE. This may have performance impact on more than has to be impacted to workaround the issue we have in DES_bs_b.c.
No, IIRC we use a function that normally adds to CFLAGS and if result was positive we revert it and add to OPT_INLINE instead.
$ grep alias Makefile
OPT_INLINE = -fno-strict-aliasing
OK, "Added -fno-strict-aliasing to OPT_INLINE for linux-arm32le-neon as well", but I haven't tested this. Can you help test core on your RPi as well, using all 3 ARM targets? Thanks!
All targets build with or without OpenMP, without any warnings and self-tests OK. NEON or ASIMD is used where expected.
https://gist.github.com/magnumripper/28d9a47b459a668ec5e083530e284fca
fallback chain is AVX2 -> XOP -> AVX -> SSE4.1 -> SSE2 (is SSE4.1 needed at all?)
Maybe SSSE3 is a more important step (it's available on Core 2 and provides PSHUFB, which can be used for some rotate counts). I don't recall how much use of SSE4.1 we make over SSSE3. But it's fine to have that. Some recent Intel Atom CPUs lack AVX, but have SSE4.1, and indeed there are older non-Atom CPUs with only SSE* too.
At least SSE 4.1 is needed to build Stribog-256 and Stribog-512 at all. Not sure if those formats are of any importance though.
SSE 4.2 is definitely the least important of all, the only thing you get in Jumbo is (a lot) faster CRC-32C.
@magnumripper Thank you for testing core on your RPi. These benchmark results show that there's a 2x'ish slowdown for most things when going from 32- to 64-bit, including for scalar-only code. I suspect limitations of the compiler or/and the hardware, not shortcomings of our source code.
I would say hardware. Definitely not our code and highly unlikely gcc. Perhaps a RPi is designed with 32-bit in mind, taking some short-cuts that harm 64-bit mode. Or whatever.
Anyway it's good to see that both core and Jumbo behaves as expected in all other aspects!
These are more for fun than reliable data.
Benchmarking: dynamic=md5(sha1($s).md5($p)) [128/128 ASIMD 4x2]... DONE
Many salts: 3186K c/s real, 3186K c/s virtual
Only one salt: 1570K c/s real, 1570K c/s virtual
This expression will use the RDP dynamic compiler format.
Benchmarking: dynamic=md5(sha1($s.$p).md5($p)) [Dynamic RDP]... DONE
Many salts: 298480 c/s real, 298480 c/s virtual
Only one salt: 290640 c/s real, 290640 c/s virtual
Benchmarking: dynamic=md5($p) [128/128 ASIMD 4x2]... DONE
Raw: 6916K c/s real, 6916K c/s virtual
Benchmarking: dynamic=md5(sha1($s).md5($p)) [128/128 NEON 4x2]... DONE
Many salts: 2905K c/s real, 2905K c/s virtual
Only one salt: 1386K c/s real, 1386K c/s virtual
This expression will use the RDP dynamic compiler format.
Benchmarking: dynamic=md5(sha1($s.$p).md5($p)) [Dynamic RDP]... DONE
Many salts: 154604 c/s real, 155120 c/s virtual
Only one salt: 153488 c/s real, 154000 c/s virtual
Benchmarking: dynamic=md5($p) [128/128 NEON 4x2]... DONE
Raw: 5954K c/s real, 5954K c/s virtual
Testing: sha1crypt, NetBSD's sha1crypt [PBKDF1-SHA1 128/128 ASIMD 4x]... (4xOMP)
sha1crypt OMP autotune using test db with iteration count of 64000
OMP scale 1: 32 crypts (1x32) in 0.302812 seconds, 105 c/s +
Autotuned OMP scale 1, preset is 4
Testing: sha256crypt, crypt(3) $5$ (rounds=5000) [SHA256 128/128 ASIMD 4x]... (4xOMP)
Loaded 7 hashes with 5 different salts to test db from test vectors
sha256crypt OMP autotune using test db with iteration count of 5000
OMP scale 1: 512 crypts (1x512) in 0.620643 seconds, 824 c/s +
Autotuned OMP scale 1, preset is 2
Testing: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 ASIMD 2x]... (4xOMP)
Loaded 6 hashes with 4 different salts to test db from test vectors
sha512crypt OMP autotune using test db with iteration count of 5000
OMP scale 1: 256 crypts (1x256) in 0.510378 seconds, 501 c/s +
Testing: sha1crypt, NetBSD's sha1crypt [PBKDF1-SHA1 128/128 ASIMD 4x]... (4xOMP)
Loaded 4 hashes with 2 different salts to test db from test vectors
sha1crypt OMP autotune using test db with iteration count of 64000
OMP scale 1: 32 crypts (1x32) in 0.435939 seconds, 73 c/s +
Autotuned OMP scale 1, preset is 4
Testing: sha256crypt, crypt(3) $5$ (rounds=5000) [SHA256 128/128 NEON 4x]... (4xOMP)
Loaded 7 hashes with 5 different salts to test db from test vectors
sha256crypt OMP autotune using test db with iteration count of 5000
OMP scale 1: 512 crypts (1x512) in 1.245014 seconds, 411 c/s +
Autotuned OMP scale 1, preset is 2
Testing: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 NEON 2x]... (4xOMP)
Loaded 6 hashes with 4 different salts to test db from test vectors
sha512crypt OMP autotune using test db with iteration count of 5000
OMP scale 1: 256 crypts (1x256) in 0.822177 seconds, 311 c/s +
All: I would still like someone to help with relbench runs, aka issue #2914.
I've done a lot of MPI/node/fork tests, had to commit a minor cosmetic thing (d973b6d7f) but otherwise everything seems fine and dandy. And groovy!
Great work, @magnumripper.
All: I am still waiting for someone to volunteer for the relbench runs, aka issue #2914. This is urgent.
I intend to finalize 1.9.0 core today, then get some sleep, then finalize 1.9.0-jumbo-1 tomorrow and release both at once. I hope @magnumripper will merge the final batch of core changes (perhaps documentation only) promptly.
As long as I have them no later than by midnight (24h from now) I can do it. After that I'm gone for a while.
Lets's list / discuss here what we need to do before a release. Or rather, a list of NEED, one of NICE and one of DON'Ts.
See also #1879