Closed solardiz closed 2 years ago
Alternative implementation could store hashes/fingerprints instead of actual candidate passwords. As I read, hashcat brain stores 64-bit hashes. We could, too, for perhaps a sufficiently low rate of false positive matches (and thus of potential false negatives in terms of password cracking, where we'd wrongly skip a non-dupe). This eliminates the need to bother with variable-length entries, and (more importantly) likely lets us store more entries per buffer size. We need to compute some fast hashes anyway, for fast lookup, although we'd need to compute larger hashes to have enough bits for index and for storage.
For short passwords (smaller than our chosen hash storage size), we could store them directly, and allocate and set a bit to indicate that. Then our hashes would be one bit smaller, but we'd have assurance of absolutely no false matches/skips until a certain password length (perhaps 8). This could also speed up processing of such passwords (only need to compute a shorter hash, for the index only).
Even before I opened this issue, I've been thinking of how to use a cuckoo filter for this, and specifically of whether/how to integrate it with passwdqc (pwqfilter). I didn't mention this at first because I thought there was a major reason not to go this way: with a cuckoo filter, there's no quick and easy way to replace the oldest entries first. However, thinking of this more, maybe there's not that much need to. Maybe it's fine to replace random entries, or to probabilistically replace entries for less frequently seen candidates, if this enables us to store way more entries in same size in bytes. With a ring buffer and a hash table, there's a maybe 2x size overhead on storing not only candidates or their hashes, but also next-in-bucket pointers, and on needing the hash table. Cuckoo filter avoids that, storing only the fingerprints.
To keep track of how frequently a candidate was produced, we could add a hit count field (a few bits, or just one bit maybe to distinguish 1 vs. 2+). Alternatively, we could simply be adding up to a few (or maybe up to 2) duplicate entries (a classic 2x4 cuckoo filter can hold up to 8 duplicates). With duplicates in there, our random replacements will less likely completely remove detection of a candidate that was already seen more than once.
We can also optionally store the filter to disk (and read it back), or mmap it, or keep it in SysV shm. For example, it'd be possible to first run through the RockYou wordlist, then (even in a separate program invocation) proceed with incremental but skipping candidates that were on the wordlist. We'd need to have optional read-only filter mode where the filter is read-write e.g. during a wordlist(+rules) run, but is then read-only during subsequent e.g. incremental mode run (a mode that does not produce duplicates on its own).
If we use (only) the same filter file format that pwqfilter uses (which implies not adding a count field), it would even be possible to reuse files generated with pwqfilter, or to generate new and more compact files (than what JtR would generate while cracking) with pwqfilter quickly, if, say, the original wordlist run didn't produce/save a filter. It would also be possible to do the reverse: test only candidates that do appear in a filter - e.g., run any cracking mode's overlap with what's in HIBP (file size 2.5 GB, speed of a few million lookups per second) - if that would ever make sense (perhaps it would at least for research).
this could be a ring buffer and a hash table. [...] need to be progressively removing old entries from the hash table as the ring buffer's contents are about to be overwritten with new entries.
BTW, we have this for single mode (with tiny per-salt buffers) since 1998 or so, see single_add_key()
and single_process_buffer()
. It's kind of weird we never implemented the same for wordlist mode. As I recall, the reasons why I treated single mode specially in this respect were that it required such per-salt buffering (separate from what set_key()
would use) for other reasons anyway (such as to support bitslice DES) and that GECOS data + rules would produce many duplicates avoidable even with a tiny buffer.
Further, with variable-size entries (like
unique
already uses for different string lengths), we need to support a case when adding just one new entry requires removal of several previous entries from the hash table first.
Single mode's implementation uses fixed maximum lengths, so it only removes one item from a hash table bucket prior to reusing that item's buffer space. This made sense back when this was typically used on descrypt hashes. We might want to update it to support variable lengths now, or to store fast hashes of passwords instead, or even to use per-salt cuckoo filters instead of these ring buffers and hash tables.
Unfortunately, we've already wasted the --dupe-suppression
option name:
--dupe-suppression suppress all duplicates from wordlist
Normally, consecutive duplicates are ignored when reading a wordlist file.
This switch enables full dupe suppression, using some memory and a little
extra start-up time. This option implies preload regardless of file size,
see the --mem-file-size option.
Should we reuse this name, enhance this option (make it accept a parameter), or add a new one?
As an option, we can rename the above to --wordlist-dupe-suppression
and reuse the name as --dupe-suppression[=SIZE]
(if =SIZE
is not specified, then a default opportunistic dupe suppression cache size will be used - from john.conf
or compile-time). That way, the two kinds of suppression can be used in arbitrary combinations.
The opportunistic dupe suppression shouldn't be limited to wordlist mode, but would also work e.g. in batch mode (for wordlist followed by incremental).
We can also have --dupe-suppression=0
or --no-dupe-suppression
if we ever make it the default (perhaps with a small default cache size and only for slow hashes - need to add a format flag, or adapt by actual p/s rate at runtime).
We could also (later) enhance the option to be --dupe-suppression[=SIZE|FILE]
, to allow for (re)use of mmap'ed cache files (probably on tmpfs). For an existing file, the size shouldn't need to be specified - however, there's the question of how to specify it for a new file (just create the file with dd
? or with pwqfilter
? or also have a size+file syntax in john
?)
Also, with ability to specify an external file, this is not necessarily dupe suppression - it can suppress anything matching file content, or anything not matching it. We could even support multiple such files and a Boolean expression around those, but this gets tricky. Also, we could have the file updated (usually reasonable if it's used for dupe suppression) or kept as-is. So many possibilities that we probably shouldn't try to come up with a syntax to support them all right away, and should get basic functionality in first.
Should we reuse this name, enhance this option (make it accept a parameter), or add a new one?
I think we should reuse this, with the benefit we can also drop some code from the horrific wordlist.c
.
we can also drop some code from the horrific
wordlist.c
.
Do you suggest completely dropping the current functionality of --dupe-suppression
(not keeping it under some other name)? I thought of keeping it as well, but now that you mention this actually makes sense to me.
That code would still have some merit (rejecting dupes before sending them through millions of rules) but given the sad state of that file, it's an opportunity to drop some code. But we could decide on it later, renaming current option to --wordlist-dupe-suppression
like you said.
I am experimenting with a trivial implementation of this now (~100 LOC, hard-coded settings). Using our new default wordlist and best64
rules and --stdout | ./unique
, we had:
133327640p 0:00:00:23 100.00% (2022-02-02 11:24) 5697Kp/s srocks
Total lines read: 133327640, unique lines written: 74214355 (55%), no slow passes
With a 4M entry cache (1M x 4-way):
80842273p 0:00:00:23 100.00% (2022-02-02 12:07) 3454Kp/s yssmey
Total lines read: 80842273, unique lines written: 74214355 (91%), no slow passes
64M (16M x 4):
74754241p 0:00:00:23 100.00% (2022-02-02 12:13) 3133Kp/s yssmey
Total lines read: 74754241, unique lines written: 74214355 (99%), no slow passes
The reported p/s reduces mostly because it does not count the rejected duplicate candidates. More importantly, there's no real time increase here - looks like the dupe checking of all costs roughly the same that puts()
of the now-rejected candidates did. For a rule set producing fewer dupes or/and for hashing faster than puts()
, there would be a performance cost.
And here's 4K (1K x 4), which is much smaller than wordlist size:
116727220p 0:00:00:20 100.00% (2022-02-02 12:27) 5638Kp/s srocks
Total lines read: 116727220, unique lines written: 74214355 (63%), no slow passes
Curiously, this still does provide significant improvement over not dupe-checking (we had 133327640p 0:00:00:23
, 55% unique), and somehow there's even a real time improvement here (with the cache being in CPU cache, perhaps the checks are really quick).
That code would still have some merit (rejecting dupes before sending them through millions of rules)
Why would there be any dupes in an input wordlist in the first place, especially when the person knows about the problem (they wouldn't use the option otherwise)? I have only one explanation: truncation at loading. However, most relevant truncation (format-specific) applies after rules, not before. So the feature looks pretty useless to me, and I've never used it.
but given the sad state of that file, it's an opportunity to drop some code. But we could decide on it later, renaming current option to
--wordlist-dupe-suppression
like you said.
For my testing now (and perhaps for initial commit), I am just adding my extra functionality to the same option, on top of its existing functionality. For small wordlists and lots of rules, the overhead of the old functionality is insignificant. For larger wordlists, I suppose we'll need to drop the old functionality or add a separate knob for it.
BTW, I've since tuned things some more (hash function, cache associativity, eviction policy), so that 4M (now as 0.5M x 8) gives:
80367531p 0:00:00:23 100.00% (2022-02-02 19:28) 3388Kp/s yssmey
Total lines read: 80367531, unique lines written: 74214355 (92%), no slow passes
That's only a tiny additional improvement because the previous was lightly tuned already.
tuned things some more (hash function, cache associativity, eviction policy)
I now think I ended up overfitting my test case, especially as it relates to the cache eviction policy. In that case, it's beneficial to hold an initial set of candidates in cache "forever". Simply adding --rules-skip-nop
changes that. So I'm now making changes for the eviction policy to make sense in general, even at the cost of reducing efficiency at my test case. Now at:
81974363p 0:00:00:19 100.00% (2022-02-02 22:41) 4171Kp/s yssmey
Total lines read: 81974363, unique lines written: 74214355 (90%), no slow passes
That code would still have some merit (rejecting dupes before sending them through millions of rules)
Why would there be any dupes in an input wordlist in the first place, especially when the person knows about the problem (they wouldn't use the option otherwise)?
The de-dupe code kicks in (automagically) for --loopback
mode, and IIRC that was the very reason it was added - the explicit use for wordlists was just a "bonus" but in nearly all situations it'd be more natural to have a unique'd wordlist in the first place.
Given --loopback
would normally produce relatively few candidates, dropping it in favor of your new magic should be Good Enough™️ even for the millions of rules case.
On another note, do you have any thoughts on having forked nodes somehow sharing the "database" so one node wouldn't allow a candidate that some other node already used? I believe our current node/fork code often produce more dupes than running a single node, for several reasons.
For MPI it'd be even trickier, if not impossible (except under certain conditions). Perhaps something akin to hashcat's "brain" would work in that case, it was a long time I read up on that and I never tried it.
The de-dupe code kicks in (automagically) for
--loopback
mode
Oh, this makes sense now. Maybe we can keep it for that mode only for now, which doesn't require documentation.
any thoughts on having forked nodes somehow sharing the "database" so one node wouldn't allow a candidate that some other node already used?
Yes, I've been thinking of adding a suppress_preinit()
that we'd call before fork()
if it's expected we'll need the suppressor. That call would pre-allocate memory as MAP_SHARED
. Alternatively, if we allow specification of an external file, that could be used with a file on tmpfs
. The data structure and opportunistic nature of this approach are such that we can proceed with concurrent access without locking.
The de-dupe code kicks in (automagically) for
--loopback
modeOh, this makes sense now. Maybe we can keep it for that mode only for now, which doesn't require documentation.
That'd be a great first step - perhaps we should do it within your current PR? If not, I presume using --dupe-suppression
on a truly huge wordlist will result in memory hogging to the point of triggering the OOM killer. Alas, I'm completely swamped for the next few days so I can't help with that. Should be trivial though.
I intend to bring the current PR to a merge-able state, and merge it, within the next day or two. Then I'll have to switch to other work.
Did you try using OPT_TRISTATE
? I think --no-dupe-suppression
is much more elegant than --dupe-suppression=0
Did you try using
OPT_TRISTATE
? I think--no-dupe-suppression
is much more elegant than--dupe-suppression=0
Didn't try - wrongly(?) thought it didn't support also having a parameter?
Anyway, I ran out of time for this, so that would be something you can fix later. ;-)
I can't remember the details but here's a comment in getopt.h
/*
* Spec. for OPT_TRISTATE:
*
* For format NULL (implies %d):
* option not given Set options.foobar to -1
* --foobar Set options.foobar to 1
* --no-foobar Set options.foobar to 0
* --[no-]foobar=123 OPT_ERROR_PARAM_EXT
*
* OPT_BOOL is just a special case of the above, which doesn't touch
* options.foobar if the option was not given (so will be 0).
*
* For format %d:
* option not given Set options.foobar to -1
* --foobar Set options.foobar to 1
* --no-foobar Set options.foobar to 0
* --foobar=123 Set options.foobar to 123 (or OPT_ERROR_PARAM_REQ)
* --no-foobar=123 OPT_ERROR_PARAM_EXT
*
* For format OPT_FMT_STR_ALLOC:
* option not given options.foobar is NULL (is not touched)
* --foobar if OPT_REQ_PARAM: OPT_ERROR_PARAM_REQ,
* otherwise set options.foobar to OPT_TRISTATE_NO_PARAM
* --no-foobar Set options.foobar to OPT_TRISTATE_NEGATED
* --foobar=filename Alloc options.foobar and copy "filename" to it
* --no-foobar=filename OPT_ERROR_PARAM_EXT
*/
So looks usable. I can add it later (perhaps next week).
I tried it, it's actually this simple:
diff --git a/src/options.c b/src/options.c
index b54cafb6c..fbd760498 100644
--- a/src/options.c
+++ b/src/options.c
@@ -162,7 +162,7 @@ static struct opt_entry opt_list[] = {
{"subformat", FLG_ONCE, 0, 0, USUAL_REQ_CLR | FLG_STDOUT | OPT_REQ_PARAM, OPT_FMT_STR_ALLOC, &options.subformat},
{"list", FLG_ONCE, 0, 0, USUAL_REQ_CLR | FLG_STDOUT | OPT_REQ_PARAM, OPT_FMT_STR_ALLOC, &options.listconf},
{"mem-file-size", FLG_ONCE, 0, FLG_WORDLIST_CHK, FLG_DUPESUPP | FLG_STDIN_CHK | FLG_PIPE_CHK | OPT_REQ_PARAM, Zu, &options.max_wordfile_memory},
- {"dupe-suppression", FLG_DUPESUPP, FLG_DUPESUPP, FLG_WORDLIST_CHK, 0, "%d", &options.suppressor_size},
+ {"dupe-suppression", FLG_DUPESUPP, FLG_DUPESUPP, FLG_WORDLIST_CHK, OPT_TRISTATE, "%d", &options.suppressor_size},
/*
* --fix-state-delay=N is deprecated and ignored, drop support after releasing 1.9.0-Jumbo-2
*/
@@ -284,7 +284,7 @@ static struct opt_entry opt_list[] = {
"--rules-skip-nop Skip any NOP \":\" rules (you already ran w/o rules)\n" \
"--loopback[=FILE] Like --wordlist, but extract words from a .pot file\n" \
"--mem-file-size=SIZE Size threshold for wordlist preload (default %u MB)\n" \
-"--dupe-suppression[=SIZE] Opportunistic dupe suppression for wordlist+rules\n" \
+"--[no-]dupe-suppression[=SIZE] Opportunistic dupe suppression for wordlist+rules\n" \
"--incremental[=MODE] \"Incremental\" mode [using section MODE]\n" \
"--incremental-charcount=N Override CharCount for incremental mode\n" \
"--external=MODE External mode or word filter\n" \
Unfortunately the usage blurb line is "too long". If we omit the [no-]
from it, tab completion will not work for that.
I tried it, it's actually this simple:
Oh, but that would mean just saying --dupe-suppression
will infer a size of 1? That probably needs to be handled somewhere else in your code.
Right now, --dupe-suppression
without a parameter means to use the default from john.conf
or compile-time default. We should preserve that functionality. Also, the description in doc/OPTIONS
should be kept consistent with the option's actual behavior. So I'll leave this move to tri-state out of the current PR.
will infer a size of 1?
BTW, if we simply replace 1 with DEFAULT_SIZE
, that would also affect explicit --dupe-suppression=1
, which should mean 1 MiB, not default. With these complications, I think we should just leave things as-is, and not use tri-state there.
When would --dupe-suppression
be needed at all? I mean, when is it not turned on by default?
When would
--dupe-suppression
be needed at all? I mean, when is it not turned on by default?
Specifying this option forces the dupe suppressor to stay enabled even if its efficiency appears low - this can be useful e.g. if more dupes are expected later in the run. In fact, I am repeating here what I just wrote in doc/OPTIONS
- please take a look.
please take a look
I did, I just have a teflon brain >.<
Anyway, I played around with your code a little and it works like a champ. I also like the long status - we should just add some more info to it later.
Some ideas currently left unimplemented:
mmap()
(with fallback on systems lacking it), also with explicit huge pages (and retry on failed allocation) - we have such logic for region_t
in memory.[ch]
copy-pasted from yescrypt, but it doesn't guarantee the memory reads as zero in all of its allocation types, so we can't reuse that directly (I actually tried and it worked well on my Linux, but I then realized and reverted).MAP_SHARED
region before fork()
. So we need our own memory allocation function anyway.mmap()
of external files (perhaps located on tmpfs
).passwdqc
compatible external filter files. One way to do that would be to have the cuckoo mode described above exactly match the data layout used by passwdqc
. This would also support 33-bit hashes (yes, one bit better than 32, which is possible due to semi-sort), but unfortunately then we have no good way to postpone recently hit ones from eviction (in the current implementation, we do that by re-ordering them inside the bucket).More related unimplemented ideas:
cracker.c
. However, this doesn't fit well into the rest of our code. We could implement buffering in suppressor.c
, having suppressor_process_key
return without having yet called old_process_key
, and only call the latter in a loop after the buffer fills up. We'd also need to add a suppressor_flush
and call it at the end of a cracking mode, so that the final non-full buffer is also processed. Unfortunately, this approach would involve extra copying of the candidate passwords.WAIT_SLEEP
. That way, we'd fully eliminate the suppressor's slowdown of medium-speed OpenCL formats.The suppressor could be a lot faster if it processed more than one candidate in parallel.
In fact, with a large enough buffer we could use OpenMP in the suppressor.
Observations from trying to use the suppressor for real:
mem_calloc_align
followed by random writes is initially slow when the allocation size is large - took ~3 minutes for ~100 GB - because this makes the kernel allocate small pages in random order, then juggle them and maybe combine them into huge pages.--rules-skip-nop
because I had already run the wordlist without rules, I might want to pre-populate the suppressor with wordlist entries as-is. For another example, when restoring a session with a very slow hash, I might want to have it run through the previously-completed rules and wordlist entries without actual cracking, to populate the suppressor with those candidates. Support for keeping the suppressor data in a file would also mostly address these needs - even if I didn't previously store a file, I can recreate it quickly by running the desired rules on a dummy fast hash or with --stdout
mode (maybe we'd want to introduce a suppressor-update-only flag then, to make these even faster).
- The current
mem_calloc_align
followed by random writes is initially slow when the allocation size is large - took ~3 minutes for ~100 GB - because this makes the kernel allocate small pages in random order
Actually, no - looking inside mem_calloc_align
, it has its own memset
, which means sequential write before reaching our random writes. Then it's some kernel-internal slowdown presumably when the memory is fragmented from prior use, which we might not be able to do much about.
Another thought I'd like to write down:
The current hash/indexing function uses a 32-bit hash value, which means it has significant biases when N is somewhat close to 2^32 yet not a power of 2. With the current K=8, this probably has the worst impact on efficiency at a size of 192 GiB. To avoid this, and also support more than 256 GiB without increasing K, we'd need to upgrade to 64-bit index hash (and re-test/adjust it for uniformness again) and either move away from the current multiplication-based conversion from hash to index to classic modulo division or (where available) use 64x64 to 128-bit multiplication intrinsic/instruction (perhaps with fallback to modulo division where this isn't available, or limiting the size to 256 GiB on those platforms).
- The current
mem_calloc_align
followed by random writes is initially slow when the allocation size is large - took ~3 minutes for ~100 GB - because this makes the kernel allocate small pages in random orderActually, no - looking inside
mem_calloc_align
, it has its ownmemset
, which means sequential write before reaching our random writes. Then it's some kernel-internal slowdown presumably when the memory is fragmented from prior use, which we might not be able to do much about.
Perhaps try same test using mem_calloc()
to see what happens? That function uses the real calloc
. I presume you'd get a page sized alignment for most large allocations anyway, so as good as than MEM_ALIGN_CACHE
- but I'm on thin ice here.
I presume you'd get a page sized alignment for most large allocations anyway, so as good as than
MEM_ALIGN_CACHE
- but I'm on thin ice here.
Some empirical tests show that my assumption holds under macOS but not under Linux. And anyway in this case I guess the current behavior is what we want.
in this case I guess the current behavior is what we want.
Yes, I was actually going to try adding the memset
before I realized that's what we do already. A possible improvement would be explicit use of huge pages.
On another topic, I guess the LockHalf
feature is less likely to help (and more likely to hurt) when restoring a session that's already progressed beyond the no-op rule. When starting with the no-op and other simple rules, it locks actually common passwords/patterns in memory. When restoring at some arbitrary point in the rule set, it'd lock an arbitrarily weird set of passwords/patterns. So maybe we should experiment with this and then likely ignore this setting when restoring, or when restoring beyond first rule, or make the setting a tri-state to have this new behavior optional. We'd also need to document any such change.
# Whether to lock the oldest half of entries as write-once (and only ever
# update the other half). Empirically, this often provides best results,
# but it can also backfire. The default is yes.
LockHalf = Y
Edit: Oh, I recalled I did test this with --rules-skip-nop
on best64
and it was still beneficial - but perhaps it wouldn't be way further into a typical rule set?
As currently implemented and by default (without --dupe-suppression
), the suppressor will auto-disable itself after ~67M candidates if it doesn't suppress anything by that point. However, some wordlists are all-unique and are larger than that, and dupes would start being produced by the second rule and on. We could prevent the auto-disabling:
crackstation.txt
, which some people will use? or maybe 2*N*K
, so that it's only larger than current when Size
is increased? or make it a function of p/s rate?)We could also add a keypress to manually disable the suppressor, to be used when running on a fast hash.
Edit: BTW, crackstation-human-only.txt
is just under 64M, crackstation.txt
is just over 1.2B (decimal).
Taking p/s into account is a no-brainer, that would be great.
that knowledge is currently unavailable to the suppressor.
Adding a getter function in rules.c
would also help in implementing "processing rule x/y" in detailed status.
We currently have full duplicate candidate suppression in the separate
unique
tool, and we also reuse code from there to suppress duplicate input wordlist entries and wordlist rules. However, when we're running through wordlist+rules (maybe after suppression of duplicates within the wordlist and within the rules separately), we only suppress consecutive duplicates (compare against the immediately preceding candidate password).I think we could (optionally) do better (for slow hashes or/and many salts) by maintaining and comparing against a configurable-size buffer of previous candidates.
In terms of implementation, I think this could be a ring buffer and a hash table. In
unique
, we use a non-ring buffer and a hash table. We reset them each time the buffer fills up. I think for this new feature we would instead need to be progressively removing old entries from the hash table as the ring buffer's contents are about to be overwritten with new entries. Further, with variable-size entries (likeunique
already uses for different string lengths), we need to support a case when adding just one new entry requires removal of several previous entries from the hash table first.We can possibly improve the effectiveness of such dupe suppression (percentage suppressed per buffer size) by maintaining (recent) hit counts on entries, so that frequently or/and recently hit entries would be retained for longer even if they're near the overwrite pointer (were entered into the ring buffer long ago). We can either keep them intact (jump over them in our overwrites) or re-add them right away (if needed to avoid gaps that would result from changes in nearby string lengths).
We can also reuse this logic to implement a hashcat brain alike later. Per my reading of hashcat brain documentation, it cannot maintain a maximum size and thus could eventually run out of storage resources and fail. The alternative I suggest here can instead maintain a maximum size, at the expense of imperfect dupe suppression. So it could make sense as an alternative brain implementation, including for use with hashcat if we implement the same protocol.
In fact, my motivation behind suggesting this feature now is needing to run some slow and lengthy wordlist+rules attacks where the resulting candidate password streams are too large to comfortably pass them through
unique
(orrling
orduplicut
) in advance and store the results, or to use hashcat brain. Say, terabytes. I am thinking that with a ring buffer size larger than input wordlist size (say, a few GB or tens of GB, so still fitting in RAM) I could probably suppress most duplicates without incurring as much storage and processing overhead, and also with the processing overhead comfortably spread over the attack duration. I'd first simulate this on a similar-looking but much smaller wordlist with the same ruleset, comparing the percentage of dupes suppressed byunique
or the like vs. this opportunistic approach with a ring buffer size scaled to the smaller wordlist's size (e.g., 10x the wordlist size if that's what I'd be able to afford in the target use case). Once confirmed that only acceptably few dupes are left with the opportunistic approach in this scaled-down simulation, I'd proceed to run the real attack.