openwall / john

John the Ripper jumbo - advanced offline password cracker, which supports hundreds of hash and cipher types, and runs on many operating systems, CPUs, GPUs, and even some FPGAs
https://www.openwall.com/john/
9.65k stars 2.04k forks source link

Rules engine performance #3468

Open magnumripper opened 5 years ago

magnumripper commented 5 years ago

The rules engine is a bottleneck. Can we make it faster?

See also #3467

solardiz commented 5 years ago

Thanks, magnum. My thoughts on some of these:

The rule engine pre-dates the introduction of command-line --min-length usable with all cracking modes. Yes, it makes sense to me to export this value to the rules engine via some pre-defined variable.

Implicit rule rejection when a command doesn't change the word would definitely have side-effects. Many rules currently depend on some of their commands being no-ops on a given input, as long as either not all are or the rule is in fact meant to be a no-op (yet try the input word). Adding a rule flag to explicitly enable this new behavior is a good idea. Implementing it in the existing rule interpreter would probably have slight performance impact, unless we duplicate the code (e.g., use a force-inline function and inline it twice with different values for the flag parameter, so that the compiler would customize the code - but I'm not sure this is worth the complexity).

I don't see why adding >8 to your example NT rule would speed it up. Isn't the Q after the first T8 even better (tests the length and the character being a letter)?

I also thought of enhancing the rules optimizer, but I am not sure. For now, instead of unifying the single-character commands to string commands I optimized the single-character ones' implementations to special-case repeats of 2 or 3 such commands (they handle such repeats without falling out into the main commands loop; the performance difference is especially significant for prepends). In my testing, the resulting performance was on par or sometimes even slightly better than the string commands'. So we could even unify the other way around: short string commands to groups of 2 or 3 single character ones. I expect the string commands would reliably win for length 4+.

solardiz commented 5 years ago

Another optimizer idea I had is to optimize into "bytecode" commands not intended for manual use - e.g., we could have some single-byte control code that represents roughly the equivalent of Az" and is followed by another byte representing the string length. We could also have a number of single-byte opcodes for appending small string lengths (e.g., byte values 1 to 15 append strings of those lengths, byte values 16 to 30 prepend strings of length 1 to 15, byte value 31 is to be followed by another byte of string length and is for append; we could also use values in the range of 128 to 255, since we don't intend to ever have normal rule commands there). This would be an internal JtR layer, and subject to change (in case we find a better way to reclaim some of those byte values). A question is how to report those in john.log (translate them back to user accessible commands?)

solardiz commented 5 years ago

Yet another idea is to compile to a VM that will use computed goto's, like we do in compiler.c when compiling with GCC. This removes the big switch translation layer from command characters/bytes to command handler addresses. But then we have to provide fallback code for non-GCC.

magnumripper commented 5 years ago

I don't see why adding >8 to your example NT rule would speed it up. Isn't the Q after the first T8 even better (tests the length and the character being a letter)?

I thought so too, but it was definitely faster with >8. We might want to investigate why for a starter...

magnumripper commented 5 years ago

I guess we should analyze the --rules:all after PP and see what kind of enhanced dupe rejections would give most payback. Like sorting several consecutive commands or flags in eg. ascii order where order doesn't matter, and ignore trailing Q (where M is not used after some intermediate commands) for dupe rejection, eg >3 <5 r l Q --> <5>3lrQ. Ignoring Q for rejection would get tricky though considering that if one of the actual rules really had a Q and one didn't, we want to keep the one that did.

Perhaps an optimizer step that can look at a complete rule and figure out if/how it makes a net change to word length, and add rejections accordingly. Like [ ^@ --> <+ [ ^@ and [ A0"123" -> ->3 <- [ A0"123".

magnumripper commented 5 years ago

I don't see why adding >8 to your example NT rule would speed it up. Isn't the Q after the first T8 even better (tests the length and the character being a letter)?

Verified this again (this time on macOS):

[List.Rules:NT]
:
-c T0Q
-c T1QT[z0]
-c T2QT[z0]T[z1]
-c T3QT[z0]T[z1]T[z2]
-c T4QT[z0]T[z1]T[z2]T[z3]
-c T5QT[z0]T[z1]T[z2]T[z3]T[z4]
-c T6QT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]
-c T7QT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]
-c T8QT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]T[z7]
-c T9QT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]T[z7]T[z8]
-c TAQT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]T[z7]T[z8]T[z9]
-c TBQT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]T[z7]T[z8]T[z9]T[zA]
-c TCQT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]T[z7]T[z8]T[z9]T[zA]T[zB]
-c TDQT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]T[z7]T[z8]T[z9]T[zA]T[zB]T[zC]

[List.Rules:NTrr]
:
-c T0Q
-c ->2 T1QT[z0]
-c ->3 T2QT[z0]T[z1]
-c ->4 T3QT[z0]T[z1]T[z2]
-c ->5 T4QT[z0]T[z1]T[z2]T[z3]
-c ->6 T5QT[z0]T[z1]T[z2]T[z3]T[z4]
-c ->7 T6QT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]
-c ->8 T7QT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]
-c ->9 T8QT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]T[z7]
-c ->A T9QT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]T[z7]T[z8]
-c ->B TAQT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]T[z7]T[z8]T[z9]
-c ->C TBQT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]T[z7]T[z8]T[z9]T[zA]
-c ->D TCQT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]T[z7]T[z8]T[z9]T[zA]T[zB]
-c ->E TDQT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]T[z7]T[z8]T[z9]T[zA]T[zB]T[zC]

[List.Rules:NTrrwr]
:
-c T0Q
-c ->2 >1 T1QT[z0]
-c ->3 >2 T2QT[z0]T[z1]
-c ->4 >3 T3QT[z0]T[z1]T[z2]
-c ->5 >4 T4QT[z0]T[z1]T[z2]T[z3]
-c ->6 >5 T5QT[z0]T[z1]T[z2]T[z3]T[z4]
-c ->7 >6 T6QT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]
-c ->8 >7 T7QT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]
-c ->9 >8 T8QT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]T[z7]
-c ->A >9 T9QT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]T[z7]T[z8]
-c ->B >A TAQT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]T[z7]T[z8]T[z9]
-c ->C >B TBQT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]T[z7]T[z8]T[z9]T[zA]
-c ->D >C TCQT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]T[z7]T[z8]T[z9]T[zA]T[zB]
-c ->E >D TDQT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]T[z7]T[z8]T[z9]T[zA]T[zB]T[zC]
$ rm -f ../run/john.pot && ../run/john scraped.sam -form:nt -w:r30k.lst -v:1 -max-len=8 -ru:nt 
Using default input encoding: UTF-8
Loaded 5294 password hashes with no different salts (NT [MD4 256/256 AVX2 8x3])
Press 'q' or Ctrl-C to abort, almost any other key for status
331g 0:00:00:14 DONE (2018-11-25 15:27) 22.42g/s 144299p/s 144299c/s 755484KC/s RENTERIA..KIKUMARU
Session completed

$ rm -f ../run/john.pot && ../run/john scraped.sam -form:nt -w:r30k.lst -v:1 -max-len=8 -ru:ntrr
Using default input encoding: UTF-8
Loaded 5294 password hashes with no different salts (NT [MD4 256/256 AVX2 8x3])
Press 'q' or Ctrl-C to abort, almost any other key for status
331g 0:00:00:14 DONE (2018-11-25 15:28) 22.45g/s 144495p/s 144495c/s 755484KC/s RENTERIA..KIKUMARU
Session completed

$ rm -f ../run/john.pot && ../run/john scraped.sam -form:nt -w:r30k.lst -v:1 -max-len=8 -ru:ntrrwr
Using default input encoding: UTF-8
Loaded 5294 password hashes with no different salts (NT [MD4 256/256 AVX2 8x3])
Press 'q' or Ctrl-C to abort, almost any other key for status
331g 0:00:00:09 DONE (2018-11-25 15:28) 36.21g/s 233026p/s 233026c/s 1175MC/s RENTERIA..KIKUMARU
Session completed

$ perl <r30k.lst -ne 'print unless length()<9'|wc -l
    9948
$ perl <r30k.lst -ne 'print unless length()>=9'|wc -l
   20052

Here, adding a rule reject flag made no difference (I think it should) and adding a word reject rule did (it shouldn't). I think my earlier tests under Linux using slightly different (and more) test data I saw some gain from the rule reject flag as well. Also, in my earlier tests only the wall clock time got much better while the p/s figure appeared worse (which is expected).

magnumripper commented 5 years ago

I don't see why adding >8 to your example NT rule would speed it up. Isn't the Q after the first T8 even better (tests the length and the character being a letter)?

I looked at this more closely and it's pretty obvious really: At any length, the >8 is a simple integer compare (word length was already calculated anyway), while T8Q will re-scan the full word (all of it, even if way longer than 8). The latter is obviously slower and to add insult to injury this difference in performance is multiplied by 256 at length 8 in the NT rule (and doubling for each length).

However, this did not explain the fact ->N did not gain anything (it really should!) and upon closer look it turned out the --max-len=N option did not trigger them - the format's original max. length was still considered. To my defense I'm pretty sure it did work long ago but the functionality has evolved over time and at some point where we got better separation between format's length, requested length and FMT_TRUNC, some glue got lost. I have now fixed this and the result is ->N makes a huge difference (26x boost in my test case) and while >N still makes for another 5-6% boost it became less significant. The fix is in 821b46f3 and a later commit adds both length rejection types to the NT ruleset.

magnumripper commented 5 years ago

3471 adds a min_length constant as soon as we agree on what character to use (current version uses #).

magnumripper commented 5 years ago

Another idea: If we process the rule reject flags once before the preprocessor (any ones not involving it) and once right after it, we may reduce memory use and gain some performance - especially for cases where we restart rules (like in -pipe mode). None of the current rejection flags' conditions may change during a run.

solardiz commented 5 years ago

"to add insult to injury this difference in performance is multiplied by 256 at length 8 in the NT rule (and doubling for each length)." - I think this isn't the case.

magnumripper commented 5 years ago

"to add insult to injury this difference in performance is multiplied by 256 at length 8 in the NT rule (and doubling for each length)." - I think this isn't the case.

Please elaborate. As far as I understand, the -c T8QT[z0]T[z1]T[z2]T[z3]T[z4]T[z5]T[z6]T[z7] expands to 256 rules so the slow T8Q will be enumerated 256 times. At best, instruction and data caches help a little.

solardiz commented 5 years ago

Oh, you're right. I was thinking of something else.

magnumripper commented 5 years ago

Right now I think the lowest hanging fruit is in adding rejection flags/commands where possible (especially in imported rules like "best64" or the korelogic stuff). I'm experimenting with that but I'm hitting some obstacles.

Let's say I want to add rejection commands to these contrived example rules:

r c Az"[a-z][a-z][a-z]"
r c ] ] ] ]

I see no way to do so with current rules engine except after the commands that adds/removes characters, which is slow, eg.

r Az"[a-z][a-z][a-z]" <+ >@ c
r ] ] ] ] <+ >@ c

Or is there some way to do it before the slow stuff? Let's disregard the fact we could hand optimize them to A0"[a-z][a-z][a-z]" <+ >@ r c and [ [ [ [ <+ >@ r c for a tad earlier rejection - that is tedious and/or not generally possible.

vVNM could be used if it wasn't for the fact we might go negative. It's best illustrated for the rule that removes characters:

va@4 >a va04 vb*a <b r c ] ] ] ]

We set a to (min_length - 1) - 4 and do >a. But with a min_length of less than 5, we're going negative and the a is parsed by >a as unsigned so something 250-ish - ie. it will reject everything.

On a related note, the current implementation of @ (for min_length - 1) sets it to 0 if min_length is 0 - that is suboptimal already. Could we not have all commands recognize signed arithmetic? That is, >N should not reject anything if N is negative, <N should reject everything for a negative and so on.

Even with that, we'd fail to early-reject in some cases though. For a while I thought we could want some new rejection commands similar to <N and >N except they'd check an arbitrary variable instead of current length. However, adding something like the following should be more effective and would be much easier to use than all of the above:

aN  reject unless word would pass min. & max. lengths after adding N characters
bN  reject unless word would pass min. & max. lengths after removing N characters

With that we could simply do:

a3 r c Az"[a-z][a-z][a-z]"
b4 r c ] ] ] ]

...and why not:

:[ab]1 \p[$^] [a-z]

Not sure about using a and b but they seem to be free. I believe some free mnemonics are: abghjmnw !@#&"

Oh, and even with the aN and bN I believe we might benefit from having all commands recognize signed numbers.

@solardiz I need input... or I'll possibly make some crazy PR that you might end up rejecting...

solardiz commented 5 years ago

@magnumripper What you suggest (two new commands, signed arithmetic) makes sense to me.

When you say <+ >@ closer to the end of a rule is slow you seem to imply the rejections would happen often, whereas I think for most hash types and input words they'd be very rare. Perhaps use of the min/max command-line options changes that - is this the case you're optimizing for? Also, having these commands closer to the end of a rule, if/when we have to, becomes redundant with the length check JtR itself will perform after fully applying the rule.

I'm concerned there would be performance impact from adding the >@ when there's no minimum specified. Is the plan to squeeze these commands out in the initial parsing step when there's no minimum, the same step that currently squeezes out explicit NOP commands (whitespace and :)?

solardiz commented 5 years ago

Please coordinate with the Hashcat project regarding the specific command characters.

magnumripper commented 5 years ago

Yes, much of it are optimizations for using --min/max-len options (eg. after having brute-forced up to a certain length) or when attacking eg. WPAPSK formats which have an actual min_length of 8. Not to mention the new default single mode feature to decrease max_length (within reason) rather than decreasing KPC for a GPU.

Having the optimizer strip things out whenever possible make a lot of sense, I'll keep that in mind.

@jsteube I believe the aN and bN commands are not yet used. To recap:

aN  reject unless word would pass min. & max. lengths after adding N characters
bN  reject unless word would pass min. & max. lengths after removing N characters

Are you OK with us using them for this or do you have some better suggestion? aN can be a mnemonic for "Add N early-reject" but I can't find a similar good letter for the "delete N early-reject" except that b comes right after a...

Also, I can't recall whether Hashcat has constants and/or variables at all, but we're about to switch them from something like uint8_t to int8_t per above.

jsteube commented 5 years ago

For "b" I don't see any problem, but keep in mind that InsidePro was using "a" for Toggle cases. If you don't mind, feel free using it, it's not used in hashcat. Thanks for asking!