openwall / john

John the Ripper jumbo - advanced offline password cracker, which supports hundreds of hash and cipher types, and runs on many operating systems, CPUs, GPUs, and even some FPGAs
https://www.openwall.com/john/
Other
10.05k stars 2.08k forks source link

Opportunistic dupe suppression for wordlist+rules #4728

Closed solardiz closed 2 years ago

solardiz commented 3 years ago

We currently have full duplicate candidate suppression in the separate unique tool, and we also reuse code from there to suppress duplicate input wordlist entries and wordlist rules. However, when we're running through wordlist+rules (maybe after suppression of duplicates within the wordlist and within the rules separately), we only suppress consecutive duplicates (compare against the immediately preceding candidate password).

I think we could (optionally) do better (for slow hashes or/and many salts) by maintaining and comparing against a configurable-size buffer of previous candidates.

In terms of implementation, I think this could be a ring buffer and a hash table. In unique, we use a non-ring buffer and a hash table. We reset them each time the buffer fills up. I think for this new feature we would instead need to be progressively removing old entries from the hash table as the ring buffer's contents are about to be overwritten with new entries. Further, with variable-size entries (like unique already uses for different string lengths), we need to support a case when adding just one new entry requires removal of several previous entries from the hash table first.

We can possibly improve the effectiveness of such dupe suppression (percentage suppressed per buffer size) by maintaining (recent) hit counts on entries, so that frequently or/and recently hit entries would be retained for longer even if they're near the overwrite pointer (were entered into the ring buffer long ago). We can either keep them intact (jump over them in our overwrites) or re-add them right away (if needed to avoid gaps that would result from changes in nearby string lengths).

We can also reuse this logic to implement a hashcat brain alike later. Per my reading of hashcat brain documentation, it cannot maintain a maximum size and thus could eventually run out of storage resources and fail. The alternative I suggest here can instead maintain a maximum size, at the expense of imperfect dupe suppression. So it could make sense as an alternative brain implementation, including for use with hashcat if we implement the same protocol.

In fact, my motivation behind suggesting this feature now is needing to run some slow and lengthy wordlist+rules attacks where the resulting candidate password streams are too large to comfortably pass them through unique (or rling or duplicut) in advance and store the results, or to use hashcat brain. Say, terabytes. I am thinking that with a ring buffer size larger than input wordlist size (say, a few GB or tens of GB, so still fitting in RAM) I could probably suppress most duplicates without incurring as much storage and processing overhead, and also with the processing overhead comfortably spread over the attack duration. I'd first simulate this on a similar-looking but much smaller wordlist with the same ruleset, comparing the percentage of dupes suppressed by unique or the like vs. this opportunistic approach with a ring buffer size scaled to the smaller wordlist's size (e.g., 10x the wordlist size if that's what I'd be able to afford in the target use case). Once confirmed that only acceptably few dupes are left with the opportunistic approach in this scaled-down simulation, I'd proceed to run the real attack.

solardiz commented 3 years ago

Alternative implementation could store hashes/fingerprints instead of actual candidate passwords. As I read, hashcat brain stores 64-bit hashes. We could, too, for perhaps a sufficiently low rate of false positive matches (and thus of potential false negatives in terms of password cracking, where we'd wrongly skip a non-dupe). This eliminates the need to bother with variable-length entries, and (more importantly) likely lets us store more entries per buffer size. We need to compute some fast hashes anyway, for fast lookup, although we'd need to compute larger hashes to have enough bits for index and for storage.

For short passwords (smaller than our chosen hash storage size), we could store them directly, and allocate and set a bit to indicate that. Then our hashes would be one bit smaller, but we'd have assurance of absolutely no false matches/skips until a certain password length (perhaps 8). This could also speed up processing of such passwords (only need to compute a shorter hash, for the index only).

solardiz commented 3 years ago

Even before I opened this issue, I've been thinking of how to use a cuckoo filter for this, and specifically of whether/how to integrate it with passwdqc (pwqfilter). I didn't mention this at first because I thought there was a major reason not to go this way: with a cuckoo filter, there's no quick and easy way to replace the oldest entries first. However, thinking of this more, maybe there's not that much need to. Maybe it's fine to replace random entries, or to probabilistically replace entries for less frequently seen candidates, if this enables us to store way more entries in same size in bytes. With a ring buffer and a hash table, there's a maybe 2x size overhead on storing not only candidates or their hashes, but also next-in-bucket pointers, and on needing the hash table. Cuckoo filter avoids that, storing only the fingerprints.

To keep track of how frequently a candidate was produced, we could add a hit count field (a few bits, or just one bit maybe to distinguish 1 vs. 2+). Alternatively, we could simply be adding up to a few (or maybe up to 2) duplicate entries (a classic 2x4 cuckoo filter can hold up to 8 duplicates). With duplicates in there, our random replacements will less likely completely remove detection of a candidate that was already seen more than once.

We can also optionally store the filter to disk (and read it back), or mmap it, or keep it in SysV shm. For example, it'd be possible to first run through the RockYou wordlist, then (even in a separate program invocation) proceed with incremental but skipping candidates that were on the wordlist. We'd need to have optional read-only filter mode where the filter is read-write e.g. during a wordlist(+rules) run, but is then read-only during subsequent e.g. incremental mode run (a mode that does not produce duplicates on its own).

If we use (only) the same filter file format that pwqfilter uses (which implies not adding a count field), it would even be possible to reuse files generated with pwqfilter, or to generate new and more compact files (than what JtR would generate while cracking) with pwqfilter quickly, if, say, the original wordlist run didn't produce/save a filter. It would also be possible to do the reverse: test only candidates that do appear in a filter - e.g., run any cracking mode's overlap with what's in HIBP (file size 2.5 GB, speed of a few million lookups per second) - if that would ever make sense (perhaps it would at least for research).

solardiz commented 3 years ago

this could be a ring buffer and a hash table. [...] need to be progressively removing old entries from the hash table as the ring buffer's contents are about to be overwritten with new entries.

BTW, we have this for single mode (with tiny per-salt buffers) since 1998 or so, see single_add_key() and single_process_buffer(). It's kind of weird we never implemented the same for wordlist mode. As I recall, the reasons why I treated single mode specially in this respect were that it required such per-salt buffering (separate from what set_key() would use) for other reasons anyway (such as to support bitslice DES) and that GECOS data + rules would produce many duplicates avoidable even with a tiny buffer.

Further, with variable-size entries (like unique already uses for different string lengths), we need to support a case when adding just one new entry requires removal of several previous entries from the hash table first.

Single mode's implementation uses fixed maximum lengths, so it only removes one item from a hash table bucket prior to reusing that item's buffer space. This made sense back when this was typically used on descrypt hashes. We might want to update it to support variable lengths now, or to store fast hashes of passwords instead, or even to use per-salt cuckoo filters instead of these ring buffers and hash tables.

solardiz commented 2 years ago

Unfortunately, we've already wasted the --dupe-suppression option name:

--dupe-suppression              suppress all duplicates from wordlist

Normally, consecutive duplicates are ignored when reading a wordlist file.
This switch enables full dupe suppression, using some memory and a little
extra start-up time.  This option implies preload regardless of file size,
see the --mem-file-size option.

Should we reuse this name, enhance this option (make it accept a parameter), or add a new one?

As an option, we can rename the above to --wordlist-dupe-suppression and reuse the name as --dupe-suppression[=SIZE] (if =SIZE is not specified, then a default opportunistic dupe suppression cache size will be used - from john.conf or compile-time). That way, the two kinds of suppression can be used in arbitrary combinations.

The opportunistic dupe suppression shouldn't be limited to wordlist mode, but would also work e.g. in batch mode (for wordlist followed by incremental).

We can also have --dupe-suppression=0 or --no-dupe-suppression if we ever make it the default (perhaps with a small default cache size and only for slow hashes - need to add a format flag, or adapt by actual p/s rate at runtime).

solardiz commented 2 years ago

We could also (later) enhance the option to be --dupe-suppression[=SIZE|FILE], to allow for (re)use of mmap'ed cache files (probably on tmpfs). For an existing file, the size shouldn't need to be specified - however, there's the question of how to specify it for a new file (just create the file with dd? or with pwqfilter? or also have a size+file syntax in john?)

solardiz commented 2 years ago

Also, with ability to specify an external file, this is not necessarily dupe suppression - it can suppress anything matching file content, or anything not matching it. We could even support multiple such files and a Boolean expression around those, but this gets tricky. Also, we could have the file updated (usually reasonable if it's used for dupe suppression) or kept as-is. So many possibilities that we probably shouldn't try to come up with a syntax to support them all right away, and should get basic functionality in first.

magnumripper commented 2 years ago

Should we reuse this name, enhance this option (make it accept a parameter), or add a new one?

I think we should reuse this, with the benefit we can also drop some code from the horrific wordlist.c.

solardiz commented 2 years ago

we can also drop some code from the horrific wordlist.c.

Do you suggest completely dropping the current functionality of --dupe-suppression (not keeping it under some other name)? I thought of keeping it as well, but now that you mention this actually makes sense to me.

magnumripper commented 2 years ago

That code would still have some merit (rejecting dupes before sending them through millions of rules) but given the sad state of that file, it's an opportunity to drop some code. But we could decide on it later, renaming current option to --wordlist-dupe-suppression like you said.

solardiz commented 2 years ago

I am experimenting with a trivial implementation of this now (~100 LOC, hard-coded settings). Using our new default wordlist and best64 rules and --stdout | ./unique, we had:

133327640p 0:00:00:23 100.00% (2022-02-02 11:24) 5697Kp/s srocks
Total lines read: 133327640, unique lines written: 74214355 (55%), no slow passes

With a 4M entry cache (1M x 4-way):

80842273p 0:00:00:23 100.00% (2022-02-02 12:07) 3454Kp/s yssmey
Total lines read: 80842273, unique lines written: 74214355 (91%), no slow passes

64M (16M x 4):

74754241p 0:00:00:23 100.00% (2022-02-02 12:13) 3133Kp/s yssmey
Total lines read: 74754241, unique lines written: 74214355 (99%), no slow passes

The reported p/s reduces mostly because it does not count the rejected duplicate candidates. More importantly, there's no real time increase here - looks like the dupe checking of all costs roughly the same that puts() of the now-rejected candidates did. For a rule set producing fewer dupes or/and for hashing faster than puts(), there would be a performance cost.

solardiz commented 2 years ago

And here's 4K (1K x 4), which is much smaller than wordlist size:

116727220p 0:00:00:20 100.00% (2022-02-02 12:27) 5638Kp/s srocks
Total lines read: 116727220, unique lines written: 74214355 (63%), no slow passes

Curiously, this still does provide significant improvement over not dupe-checking (we had 133327640p 0:00:00:23, 55% unique), and somehow there's even a real time improvement here (with the cache being in CPU cache, perhaps the checks are really quick).

solardiz commented 2 years ago

That code would still have some merit (rejecting dupes before sending them through millions of rules)

Why would there be any dupes in an input wordlist in the first place, especially when the person knows about the problem (they wouldn't use the option otherwise)? I have only one explanation: truncation at loading. However, most relevant truncation (format-specific) applies after rules, not before. So the feature looks pretty useless to me, and I've never used it.

but given the sad state of that file, it's an opportunity to drop some code. But we could decide on it later, renaming current option to --wordlist-dupe-suppression like you said.

For my testing now (and perhaps for initial commit), I am just adding my extra functionality to the same option, on top of its existing functionality. For small wordlists and lots of rules, the overhead of the old functionality is insignificant. For larger wordlists, I suppose we'll need to drop the old functionality or add a separate knob for it.

BTW, I've since tuned things some more (hash function, cache associativity, eviction policy), so that 4M (now as 0.5M x 8) gives:

80367531p 0:00:00:23 100.00% (2022-02-02 19:28) 3388Kp/s yssmey
Total lines read: 80367531, unique lines written: 74214355 (92%), no slow passes

That's only a tiny additional improvement because the previous was lightly tuned already.

solardiz commented 2 years ago

tuned things some more (hash function, cache associativity, eviction policy)

I now think I ended up overfitting my test case, especially as it relates to the cache eviction policy. In that case, it's beneficial to hold an initial set of candidates in cache "forever". Simply adding --rules-skip-nop changes that. So I'm now making changes for the eviction policy to make sense in general, even at the cost of reducing efficiency at my test case. Now at:

81974363p 0:00:00:19 100.00% (2022-02-02 22:41) 4171Kp/s yssmey
Total lines read: 81974363, unique lines written: 74214355 (90%), no slow passes
magnumripper commented 2 years ago

That code would still have some merit (rejecting dupes before sending them through millions of rules)

Why would there be any dupes in an input wordlist in the first place, especially when the person knows about the problem (they wouldn't use the option otherwise)?

The de-dupe code kicks in (automagically) for --loopback mode, and IIRC that was the very reason it was added - the explicit use for wordlists was just a "bonus" but in nearly all situations it'd be more natural to have a unique'd wordlist in the first place.

Given --loopback would normally produce relatively few candidates, dropping it in favor of your new magic should be Good Enough™️ even for the millions of rules case.

magnumripper commented 2 years ago

On another note, do you have any thoughts on having forked nodes somehow sharing the "database" so one node wouldn't allow a candidate that some other node already used? I believe our current node/fork code often produce more dupes than running a single node, for several reasons.

For MPI it'd be even trickier, if not impossible (except under certain conditions). Perhaps something akin to hashcat's "brain" would work in that case, it was a long time I read up on that and I never tried it.

solardiz commented 2 years ago

The de-dupe code kicks in (automagically) for --loopback mode

Oh, this makes sense now. Maybe we can keep it for that mode only for now, which doesn't require documentation.

any thoughts on having forked nodes somehow sharing the "database" so one node wouldn't allow a candidate that some other node already used?

Yes, I've been thinking of adding a suppress_preinit() that we'd call before fork() if it's expected we'll need the suppressor. That call would pre-allocate memory as MAP_SHARED. Alternatively, if we allow specification of an external file, that could be used with a file on tmpfs. The data structure and opportunistic nature of this approach are such that we can proceed with concurrent access without locking.

magnumripper commented 2 years ago

The de-dupe code kicks in (automagically) for --loopback mode

Oh, this makes sense now. Maybe we can keep it for that mode only for now, which doesn't require documentation.

That'd be a great first step - perhaps we should do it within your current PR? If not, I presume using --dupe-suppression on a truly huge wordlist will result in memory hogging to the point of triggering the OOM killer. Alas, I'm completely swamped for the next few days so I can't help with that. Should be trivial though.

solardiz commented 2 years ago

I intend to bring the current PR to a merge-able state, and merge it, within the next day or two. Then I'll have to switch to other work.

magnumripper commented 2 years ago

Did you try using OPT_TRISTATE? I think --no-dupe-suppression is much more elegant than --dupe-suppression=0

solardiz commented 2 years ago

Did you try using OPT_TRISTATE? I think --no-dupe-suppression is much more elegant than --dupe-suppression=0

Didn't try - wrongly(?) thought it didn't support also having a parameter?

Anyway, I ran out of time for this, so that would be something you can fix later. ;-)

magnumripper commented 2 years ago

I can't remember the details but here's a comment in getopt.h

/*
 * Spec. for OPT_TRISTATE:
 *
 * For format NULL (implies %d):
 *   option not given      Set options.foobar to -1
 *   --foobar              Set options.foobar to 1
 *   --no-foobar           Set options.foobar to 0
 *   --[no-]foobar=123     OPT_ERROR_PARAM_EXT
 *
 * OPT_BOOL is just a special case of the above, which doesn't touch
 * options.foobar if the option was not given (so will be 0).
 *
 * For format %d:
 *   option not given      Set options.foobar to -1
 *   --foobar              Set options.foobar to 1
 *   --no-foobar           Set options.foobar to 0
 *   --foobar=123          Set options.foobar to 123 (or OPT_ERROR_PARAM_REQ)
 *   --no-foobar=123       OPT_ERROR_PARAM_EXT
 *
 * For format OPT_FMT_STR_ALLOC:
 *   option not given      options.foobar is NULL (is not touched)
 *   --foobar              if OPT_REQ_PARAM: OPT_ERROR_PARAM_REQ,
 *                         otherwise set options.foobar to OPT_TRISTATE_NO_PARAM
 *   --no-foobar           Set options.foobar to OPT_TRISTATE_NEGATED
 *   --foobar=filename     Alloc options.foobar and copy "filename" to it
 *   --no-foobar=filename  OPT_ERROR_PARAM_EXT
 */

So looks usable. I can add it later (perhaps next week).

magnumripper commented 2 years ago

I tried it, it's actually this simple:

diff --git a/src/options.c b/src/options.c
index b54cafb6c..fbd760498 100644
--- a/src/options.c
+++ b/src/options.c
@@ -162,7 +162,7 @@ static struct opt_entry opt_list[] = {
    {"subformat", FLG_ONCE, 0, 0, USUAL_REQ_CLR | FLG_STDOUT | OPT_REQ_PARAM, OPT_FMT_STR_ALLOC, &options.subformat},
    {"list", FLG_ONCE, 0, 0, USUAL_REQ_CLR | FLG_STDOUT | OPT_REQ_PARAM, OPT_FMT_STR_ALLOC, &options.listconf},
    {"mem-file-size", FLG_ONCE, 0, FLG_WORDLIST_CHK, FLG_DUPESUPP | FLG_STDIN_CHK | FLG_PIPE_CHK | OPT_REQ_PARAM, Zu, &options.max_wordfile_memory},
-   {"dupe-suppression", FLG_DUPESUPP, FLG_DUPESUPP, FLG_WORDLIST_CHK, 0, "%d", &options.suppressor_size},
+   {"dupe-suppression", FLG_DUPESUPP, FLG_DUPESUPP, FLG_WORDLIST_CHK, OPT_TRISTATE, "%d", &options.suppressor_size},
 /*
  * --fix-state-delay=N is deprecated and ignored, drop support after releasing 1.9.0-Jumbo-2
  */
@@ -284,7 +284,7 @@ static struct opt_entry opt_list[] = {
 "--rules-skip-nop           Skip any NOP \":\" rules (you already ran w/o rules)\n" \
 "--loopback[=FILE]          Like --wordlist, but extract words from a .pot file\n" \
 "--mem-file-size=SIZE       Size threshold for wordlist preload (default %u MB)\n" \
-"--dupe-suppression[=SIZE]  Opportunistic dupe suppression for wordlist+rules\n" \
+"--[no-]dupe-suppression[=SIZE] Opportunistic dupe suppression for wordlist+rules\n" \
 "--incremental[=MODE]       \"Incremental\" mode [using section MODE]\n" \
 "--incremental-charcount=N  Override CharCount for incremental mode\n" \
 "--external=MODE            External mode or word filter\n" \

Unfortunately the usage blurb line is "too long". If we omit the [no-] from it, tab completion will not work for that.

magnumripper commented 2 years ago

I tried it, it's actually this simple:

Oh, but that would mean just saying --dupe-suppression will infer a size of 1? That probably needs to be handled somewhere else in your code.

solardiz commented 2 years ago

Right now, --dupe-suppression without a parameter means to use the default from john.conf or compile-time default. We should preserve that functionality. Also, the description in doc/OPTIONS should be kept consistent with the option's actual behavior. So I'll leave this move to tri-state out of the current PR.

solardiz commented 2 years ago

will infer a size of 1?

BTW, if we simply replace 1 with DEFAULT_SIZE, that would also affect explicit --dupe-suppression=1, which should mean 1 MiB, not default. With these complications, I think we should just leave things as-is, and not use tri-state there.

magnumripper commented 2 years ago

When would --dupe-suppression be needed at all? I mean, when is it not turned on by default?

solardiz commented 2 years ago

When would --dupe-suppression be needed at all? I mean, when is it not turned on by default?

Specifying this option forces the dupe suppressor to stay enabled even if its efficiency appears low - this can be useful e.g. if more dupes are expected later in the run. In fact, I am repeating here what I just wrote in doc/OPTIONS - please take a look.

magnumripper commented 2 years ago

please take a look

I did, I just have a teflon brain >.<

Anyway, I played around with your code a little and it works like a champ. I also like the long status - we should just add some more info to it later.

solardiz commented 2 years ago

Some ideas currently left unimplemented:

solardiz commented 2 years ago

More related unimplemented ideas:

solardiz commented 2 years ago

The suppressor could be a lot faster if it processed more than one candidate in parallel.

In fact, with a large enough buffer we could use OpenMP in the suppressor.

Observations from trying to use the suppressor for real:

solardiz commented 2 years ago
  • The current mem_calloc_align followed by random writes is initially slow when the allocation size is large - took ~3 minutes for ~100 GB - because this makes the kernel allocate small pages in random order

Actually, no - looking inside mem_calloc_align, it has its own memset, which means sequential write before reaching our random writes. Then it's some kernel-internal slowdown presumably when the memory is fragmented from prior use, which we might not be able to do much about.

solardiz commented 2 years ago

Another thought I'd like to write down:

The current hash/indexing function uses a 32-bit hash value, which means it has significant biases when N is somewhat close to 2^32 yet not a power of 2. With the current K=8, this probably has the worst impact on efficiency at a size of 192 GiB. To avoid this, and also support more than 256 GiB without increasing K, we'd need to upgrade to 64-bit index hash (and re-test/adjust it for uniformness again) and either move away from the current multiplication-based conversion from hash to index to classic modulo division or (where available) use 64x64 to 128-bit multiplication intrinsic/instruction (perhaps with fallback to modulo division where this isn't available, or limiting the size to 256 GiB on those platforms).

magnumripper commented 2 years ago
  • The current mem_calloc_align followed by random writes is initially slow when the allocation size is large - took ~3 minutes for ~100 GB - because this makes the kernel allocate small pages in random order

Actually, no - looking inside mem_calloc_align, it has its own memset, which means sequential write before reaching our random writes. Then it's some kernel-internal slowdown presumably when the memory is fragmented from prior use, which we might not be able to do much about.

Perhaps try same test using mem_calloc() to see what happens? That function uses the real calloc. I presume you'd get a page sized alignment for most large allocations anyway, so as good as than MEM_ALIGN_CACHE - but I'm on thin ice here.

magnumripper commented 2 years ago

I presume you'd get a page sized alignment for most large allocations anyway, so as good as than MEM_ALIGN_CACHE - but I'm on thin ice here.

Some empirical tests show that my assumption holds under macOS but not under Linux. And anyway in this case I guess the current behavior is what we want.

solardiz commented 2 years ago

in this case I guess the current behavior is what we want.

Yes, I was actually going to try adding the memset before I realized that's what we do already. A possible improvement would be explicit use of huge pages.

On another topic, I guess the LockHalf feature is less likely to help (and more likely to hurt) when restoring a session that's already progressed beyond the no-op rule. When starting with the no-op and other simple rules, it locks actually common passwords/patterns in memory. When restoring at some arbitrary point in the rule set, it'd lock an arbitrarily weird set of passwords/patterns. So maybe we should experiment with this and then likely ignore this setting when restoring, or when restoring beyond first rule, or make the setting a tri-state to have this new behavior optional. We'd also need to document any such change.

# Whether to lock the oldest half of entries as write-once (and only ever
# update the other half).  Empirically, this often provides best results,
# but it can also backfire.  The default is yes.
LockHalf = Y

Edit: Oh, I recalled I did test this with --rules-skip-nop on best64 and it was still beneficial - but perhaps it wouldn't be way further into a typical rule set?

solardiz commented 2 years ago

As currently implemented and by default (without --dupe-suppression), the suppressor will auto-disable itself after ~67M candidates if it doesn't suppress anything by that point. However, some wordlists are all-unique and are larger than that, and dupes would start being produced by the second rule and on. We could prevent the auto-disabling:

  1. Until we went through a few rules (perhaps at least 2). However, that knowledge is currently unavailable to the suppressor.
  2. When the p/s rate is very low (say, below 10k), even if the suppressor's efficiency so far is zero (nothing suppressed). Easy.
  3. Until we've accepted a larger number of candidates (say, 2 billion, to be larger than crackstation.txt, which some people will use? or maybe 2*N*K, so that it's only larger than current when Size is increased? or make it a function of p/s rate?)

We could also add a keypress to manually disable the suppressor, to be used when running on a fast hash.

Edit: BTW, crackstation-human-only.txt is just under 64M, crackstation.txt is just over 1.2B (decimal).

magnumripper commented 2 years ago

Taking p/s into account is a no-brainer, that would be great.

that knowledge is currently unavailable to the suppressor.

Adding a getter function in rules.c would also help in implementing "processing rule x/y" in detailed status.