Open magnumripper opened 5 years ago
Personally, I'd prefer to have any "RFC / discussion" stuff on john-dev or john-users (depending on source code focus vs. not), rather than buried in GitHub issues (which might not last).
For "slow hash" GPU formats, I think we simply need to set min_keys_per_crypt
high enough to utilize the device. And this is where single mode's large rule count and buffering of cracked passwords for testing them against all salts both will help. Of course with a large salt count this will be taking a long while, but that's mostly not a matter of efficiency - it's a matter of the amount of computation to do.
We could improve things for slow hashes combined with large salt counts by supporting having an arbitrary mix of salts and candidate passwords in a single crypt_all()
call, but that's a major redesign not limited to single mode. It'd also help with frequency-sorted wordlists, etc.
Brainstorming the GPU case:
The dimensioning factors are number of salts and max password length. At length 8 and one salt, we can obviously afford using min_kpc = max_kpc
. But at length 125 and 1,000 salts and a very fast format that wants a GWS of 512K, that would be 64 GB of buffers.
So KISS™, how about letting the shared auto-tune code simply know about the above formula, auto-tune max_keys/GWS just like today, then start with min_kpc = max_kpc and halving min_kpc until the buffer size gets sane? Pretty easy. What figure should we aim to go below? Perhaps 1 GB default, with option to override in config?
After 7c82c39, all OpenCL formats use the optimal min. kpc given the limits (15-bit indeces) present in single mode.
I have ensured that --max-len
option does affect the single mode buffers (it already did). This means we can get much better performance for some formats that supports a large max. length, by capping it using eg. --max-len=16
. We could possible warn/hint for this.
Here's single.c
:
/*
* We use "short" for buffered key indices and "unsigned short" for buffered
* key offsets - make sure these don't overflow.
*/
if (key_count > 0x8000)
key_count = 0x8000;
while (key_count > 0xffff / length + 1)
key_count >>= 1;
I haven't dug closer into this at all - but could we not make the buffered key indices unsigned as well?
Edit: I guess that would not help unless max length is 0 which doesn't make sense anyway.
As long as we're stuck with 15-bit indeces and 16-bit offsets, we can't ever get a min. KPC larger than 4096 for a max_length of 16, if I understand things right? For many GPU formats, that wont do it.
I'll try to make it configurable in params.h (defaulting to 32-bit if HAVE_OPENCL) and see where that goes.
That would go for HAVE_ZTEX also, right?
@solardiz I'm not sure yet what SINGLE_HASH_SIZE
will affect. Should this be bumped too if I bump the max. possible keys per crypt / per salt? Would it speed up loading or running, or both?
For ZTEX, we commonly only need a minimum of a few hundred candidate passwords per board (except md5crypt-ztex
and phpass-ztex
, where it's 384 per chip, so 1.5k per board), but this can mean a minimum of a few to tens of thousands for a multi-board cluster. So yes, it's possible the 16-bit offsets would be insufficient for a ZTEX cluster without a reduced max length.
There's also OpenMP on Xeon Phi, including the modern 2nd gen with AVX-512, which we easily support. (We also support 1st gen with MIC, but that's trickier to compile jumbo for.) With ~250 hardware threads and 512 min_keys_per_crypt for bitslice implementations (which we have for a few DES-based formats), that's a minimum of ~130k candidate passwords to fully utilize the hardware. (It's reasonable to run twice fewer threads and still mostly utilize all cores when using single mode, but we wouldn't expect users to know to do that.)
In fact, with OpenMP, SMT, and bitslicing even today's typical Intel's 6-core CPUs with AVX2 need quite a large min_keys_per_crypt of e.g. 6*2*256 = 3072. This number grows further for AMD Ryzen and for multi-socket servers.
So I am wondering whether we should possibly support larger than 16-bit offsets in OpenMP builds in general.
As to SINGLE_HASH_SIZE
, as I recall that hash table is used to quickly eliminate duplicates within a buffer (refuse to add a duplicate). If the buffer is made larger, then yes, I think we also need a larger hash table. This will affect runtime speed, except when cracking very slow hashes.
Other thoughts beyond current work on this issue:
While buffering a large number of single mode passwords is required to fully utilize the hardware, it can result in highly non-optimal successful guess rate because those candidates are ordered from most likely to least likely. It'd be preferable to have a smaller buffer and to use the hardware for attacking multiple salts in parallel - but this doesn't fit into the currently implemented formats interface. This leads me to:
Maybe we could implement optional splitting of salts between nodes (when the number of salts is at least the same, and ideally a multiple of, the number of nodes) and then --node and --fork would achieve an effect similar to what a revised formats API would have. IIRC, this is how scripts bundled with Crack distributed the workload between nodes (and a drawback was they wouldn't work on fewer salts than nodes). Maybe we finally identified a reason to have this mode of operation too.
OK, I'll use 32-bit types and bump SINGLE_HASH_SIZE
to 15 for OpenCL/ZTEX/OpenMP. That's a moderate figure but it seems to eat memory. I think I'll change it to be a run-time decision (using that macro as a maximum) instead.
Anyway this is way cool speeds; WPA-PSK running a thousand salts:
Device 6: GeForce GTX TITAN X
Loaded 1119 password hashes with 1119 different salts (wpapsk-opencl, WPA/WPA2/PMF/PMKID PSK [PBKDF2-SHA1 OpenCL])
Remaining 1067 password hashes with 1067 different salts
Loaded hashes with cost 1 (key version [0:PMKID 1:WPA 2:WPA2 3:802.11w]) varying from 0 to 2
Note: Minimum length forced to 8 by format
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:09 0.01% (ETA: 2018-12-02 08:50) 0g/s 111093p/s 111093c/s 111093C/s GPU:51C util:95% candidates..redacted
0g 0:00:00:37 0.01% (ETA: 2018-12-05 10:12) 0g/s 212810p/s 212810c/s 212810C/s GPU:64C util:96% candidates..redacted
0g 0:00:01:23 0.01% (ETA: 2018-12-07 13:14) 0g/s 219988p/s 219988c/s 219988C/s GPU:74C util:96% candidates..redacted
0g 0:00:03:09 0.03% (ETA: 2018-12-08 22:17) 0g/s 220150p/s 220150c/s 220150C/s GPU:78C util:96% candidates..redacted
Benchmark says 287018 c/s for a single salt. The auto-tune is a bit more restrictive for single mode (currently a 25% boost needed to use higher figure) but I'm not done yet 😎
When tuning SINGLE_HASH_SIZE
we should consider not only memory usage, but also speed. When that hash table no longer fits in a certain CPU cache layer, speed may become worse. 15 probably means 2^15*8 = 256 KiB, which is probably OK, but you could test if e.g. 14 is possibly same speed or faster (when cracking fast hashes).
Cool speeds for single mode on GPU, yes.
I'm thinking we might want to bump SINGLE_WORDS_PAIR_MAX defaults, for faster filling of huge buffers. I was thinking 12 for GPUs and 8 for OpenMP but I don't really base that on any real data. Any suggestion?
I'm OK with adjusting SINGLE_WORDS_PAIR_MAX
based on device type, or at runtime based on the format's min_keys_per_crypt
, although either of these will affect how many passwords actually get cracked, which may confuse people. OTOH, single mode's password cracked are already somewhat unstable, depending on which hashes already had their passwords cracked vs. not at the time single mode was (re)started.
I think this is worth testing on real passwords to come up with reasonable settings.
Also note that having OpenMP support compiled in doesn't imply we're running more than one thread in a given invocation. In fact, some formats are not even OpenMP-capable.
Related: #2425.
I'm experimenting with something like this:
if (log2(single_db->format->params.min_keys_per_crypt) > words_pair_max) {
words_pair_max = log2(single_db->format->params.min_keys_per_crypt);
log_event("- SingleWordsPairMax increased to %d for high KPC (%d)",
log2(single_db->format->params.min_keys_per_crypt),
single_db->format->params.min_keys_per_crypt);
}
This may bump it, depending on actual runtime min_kpc regardless of if it's OpenMP, AVX-512 bitslice or a big bad GPU. And it wont ever end up too dramatic. But yes, I will test various settings against some data sets.
I went with the above for now. It's pretty sane: At a min. KPC of 1..127, it doesn't change the default of 6. At 128..255 it bumps to 7. At moderate GPU figures of eg. 32768..65535, it's 15 and with fast formats without internal mask that wants eg. 1M keys, we're up at 20. If anything, we could use a steeper increase rate.
I only now realized we can mix the types like this:
#if HAVE_OPENCL || HAVE_ZTEX
#define SINGLE_KEYS_TYPE int32_t
#define SINGLE_KEYS_UTYPE uint32_t
#define SINGLE_IDX_MAX (INT32_MAX + 1U)
#define SINGLE_BUF_MAX UINT32_MAX
#elif HAVE_OPENMP
#define SINGLE_KEYS_TYPE int16_t
#define SINGLE_KEYS_UTYPE uint32_t
#define SINGLE_IDX_MAX 0x8000
#define SINGLE_BUF_MAX UINT32_MAX
#else
#define SINGLE_KEYS_TYPE int16_t
#define SINGLE_KEYS_UTYPE uint16_t
#define SINGLE_IDX_MAX 0x8000
#define SINGLE_BUF_MAX 0xffff
#endif
The middle case will limit us to a min_keys of 32768 but that should be enough for the OpenMP case. It halves the total memory use for the buffer, which is important at many salts. Best would be to pick type at run-time: Half-slow OpenCL formats like WPA-PSK runs just fine at 32768 but for faster ones it'd be a bad limit.
Some discussion about the log2
(or not): https://github.com/openwall/john/issues/4110#issuecomment-679109274
Single mode is one of JtR's most important features and AFAIK no other cracker has anything near it. But it's ~very~ relatively slow, mostly (or solely) due to rules processing. The fact GPU formats normally can't be used doesn't help, of course.
I intend to have a look-see & brainstorm for improvements. NOTE: The rules engine is a separate issue #3468.
3
for referencing the third word, put possibly also some others: We have+
now, for "the concatenation of both" but perhaps it's enough to just change that to "for concatenation of all". We also have a-p
reject flag "reject this rule unless word pair commands are currently allowed" so we'd add a-t
flag for "reject this rule unless word triple commands are currently allowed" (note that this would also early-reject when word triples are not opted-in by some option).