Single mode performance

magnumripper commented 5 years ago

Single mode is one of JtR's most important features and AFAIK no other cracker has anything near it. But it's ~very~ relatively slow, mostly (or solely) due to rules processing. The fact GPU formats normally can't be used doesn't help, of course.

I intend to have a look-see & brainstorm for improvements. NOTE: The rules engine is a separate issue #3468.

[x] Perhaps some minor redesign could make it better suited for GPU?
[x] Audit / profile for bottlenecks. Jumbo may well have some bottleneck that John proper doesn't have.
[ ] Perhaps optionally (opt-in) have word triples in addition to word pairs. This could dramatically increase the number of words produced. New rules commands needed would obviously include 3 for referencing the third word, put possibly also some others: We have + now, for "the concatenation of both" but perhaps it's enough to just change that to "for concatenation of all". We also have a -p reject flag "reject this rule unless word pair commands are currently allowed" so we'd add a -t flag for "reject this rule unless word triple commands are currently allowed" (note that this would also early-reject when word triples are not opted-in by some option).

solardiz commented 5 years ago

Personally, I'd prefer to have any "RFC / discussion" stuff on john-dev or john-users (depending on source code focus vs. not), rather than buried in GitHub issues (which might not last).

For "slow hash" GPU formats, I think we simply need to set min_keys_per_crypt high enough to utilize the device. And this is where single mode's large rule count and buffering of cracked passwords for testing them against all salts both will help. Of course with a large salt count this will be taking a long while, but that's mostly not a matter of efficiency - it's a matter of the amount of computation to do.

We could improve things for slow hashes combined with large salt counts by supporting having an arbitrary mix of salts and candidate passwords in a single crypt_all() call, but that's a major redesign not limited to single mode. It'd also help with frequency-sorted wordlists, etc.

magnumripper commented 5 years ago

Brainstorming the GPU case: The dimensioning factors are number of salts and max password length. At length 8 and one salt, we can obviously afford using min_kpc = max_kpc. But at length 125 and 1,000 salts and a very fast format that wants a GWS of 512K, that would be 64 GB of buffers.

So KISS™, how about letting the shared auto-tune code simply know about the above formula, auto-tune max_keys/GWS just like today, then start with min_kpc = max_kpc and halving min_kpc until the buffer size gets sane? Pretty easy. What figure should we aim to go below? Perhaps 1 GB default, with option to override in config?

magnumripper commented 5 years ago

After 7c82c39, all OpenCL formats use the optimal min. kpc given the limits (15-bit indeces) present in single mode.

I have ensured that --max-len option does affect the single mode buffers (it already did). This means we can get much better performance for some formats that supports a large max. length, by capping it using eg. --max-len=16. We could possible warn/hint for this.

magnumripper commented 5 years ago

Here's single.c:

/*
 * We use "short" for buffered key indices and "unsigned short" for buffered
 * key offsets - make sure these don't overflow.
 */
    if (key_count > 0x8000)
        key_count = 0x8000;
    while (key_count > 0xffff / length + 1)
        key_count >>= 1;

I haven't dug closer into this at all - but could we not make the buffered key indices unsigned as well?

Edit: I guess that would not help unless max length is 0 which doesn't make sense anyway.

magnumripper commented 5 years ago

As long as we're stuck with 15-bit indeces and 16-bit offsets, we can't ever get a min. KPC larger than 4096 for a max_length of 16, if I understand things right? For many GPU formats, that wont do it.

I'll try to make it configurable in params.h (defaulting to 32-bit if HAVE_OPENCL) and see where that goes.

magnumripper commented 5 years ago

That would go for HAVE_ZTEX also, right?

magnumripper commented 5 years ago

@solardiz I'm not sure yet what SINGLE_HASH_SIZE will affect. Should this be bumped too if I bump the max. possible keys per crypt / per salt? Would it speed up loading or running, or both?

solardiz commented 5 years ago

For ZTEX, we commonly only need a minimum of a few hundred candidate passwords per board (except md5crypt-ztex and phpass-ztex, where it's 384 per chip, so 1.5k per board), but this can mean a minimum of a few to tens of thousands for a multi-board cluster. So yes, it's possible the 16-bit offsets would be insufficient for a ZTEX cluster without a reduced max length.

There's also OpenMP on Xeon Phi, including the modern 2nd gen with AVX-512, which we easily support. (We also support 1st gen with MIC, but that's trickier to compile jumbo for.) With ~250 hardware threads and 512 min_keys_per_crypt for bitslice implementations (which we have for a few DES-based formats), that's a minimum of ~130k candidate passwords to fully utilize the hardware. (It's reasonable to run twice fewer threads and still mostly utilize all cores when using single mode, but we wouldn't expect users to know to do that.)

In fact, with OpenMP, SMT, and bitslicing even today's typical Intel's 6-core CPUs with AVX2 need quite a large min_keys_per_crypt of e.g. 6*2*256 = 3072. This number grows further for AMD Ryzen and for multi-socket servers.

So I am wondering whether we should possibly support larger than 16-bit offsets in OpenMP builds in general.

As to SINGLE_HASH_SIZE, as I recall that hash table is used to quickly eliminate duplicates within a buffer (refuse to add a duplicate). If the buffer is made larger, then yes, I think we also need a larger hash table. This will affect runtime speed, except when cracking very slow hashes.

Other thoughts beyond current work on this issue:

While buffering a large number of single mode passwords is required to fully utilize the hardware, it can result in highly non-optimal successful guess rate because those candidates are ordered from most likely to least likely. It'd be preferable to have a smaller buffer and to use the hardware for attacking multiple salts in parallel - but this doesn't fit into the currently implemented formats interface. This leads me to:

Maybe we could implement optional splitting of salts between nodes (when the number of salts is at least the same, and ideally a multiple of, the number of nodes) and then --node and --fork would achieve an effect similar to what a revised formats API would have. IIRC, this is how scripts bundled with Crack distributed the workload between nodes (and a drawback was they wouldn't work on fewer salts than nodes). Maybe we finally identified a reason to have this mode of operation too.

magnumripper commented 5 years ago

OK, I'll use 32-bit types and bump SINGLE_HASH_SIZE to 15 for OpenCL/ZTEX/OpenMP. That's a moderate figure but it seems to eat memory. I think I'll change it to be a run-time decision (using that macro as a maximum) instead.

Anyway this is way cool speeds; WPA-PSK running a thousand salts:

Device 6: GeForce GTX TITAN X
Loaded 1119 password hashes with 1119 different salts (wpapsk-opencl, WPA/WPA2/PMF/PMKID PSK [PBKDF2-SHA1 OpenCL])
Remaining 1067 password hashes with 1067 different salts
Loaded hashes with cost 1 (key version [0:PMKID 1:WPA 2:WPA2 3:802.11w]) varying from 0 to 2
Note: Minimum length forced to 8 by format
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:09 0.01% (ETA: 2018-12-02 08:50) 0g/s 111093p/s 111093c/s 111093C/s GPU:51C util:95% candidates..redacted
0g 0:00:00:37 0.01% (ETA: 2018-12-05 10:12) 0g/s 212810p/s 212810c/s 212810C/s GPU:64C util:96% candidates..redacted
0g 0:00:01:23 0.01% (ETA: 2018-12-07 13:14) 0g/s 219988p/s 219988c/s 219988C/s GPU:74C util:96% candidates..redacted
0g 0:00:03:09 0.03% (ETA: 2018-12-08 22:17) 0g/s 220150p/s 220150c/s 220150C/s GPU:78C util:96% candidates..redacted

Benchmark says 287018 c/s for a single salt. The auto-tune is a bit more restrictive for single mode (currently a 25% boost needed to use higher figure) but I'm not done yet 😎

solardiz commented 5 years ago

When tuning SINGLE_HASH_SIZE we should consider not only memory usage, but also speed. When that hash table no longer fits in a certain CPU cache layer, speed may become worse. 15 probably means 2^15*8 = 256 KiB, which is probably OK, but you could test if e.g. 14 is possibly same speed or faster (when cracking fast hashes).

Cool speeds for single mode on GPU, yes.

magnumripper commented 5 years ago

I'm thinking we might want to bump SINGLE_WORDS_PAIR_MAX defaults, for faster filling of huge buffers. I was thinking 12 for GPUs and 8 for OpenMP but I don't really base that on any real data. Any suggestion?

solardiz commented 5 years ago

I'm OK with adjusting SINGLE_WORDS_PAIR_MAX based on device type, or at runtime based on the format's min_keys_per_crypt, although either of these will affect how many passwords actually get cracked, which may confuse people. OTOH, single mode's password cracked are already somewhat unstable, depending on which hashes already had their passwords cracked vs. not at the time single mode was (re)started.

I think this is worth testing on real passwords to come up with reasonable settings.

Also note that having OpenMP support compiled in doesn't imply we're running more than one thread in a given invocation. In fact, some formats are not even OpenMP-capable.

solardiz commented 5 years ago

Related: #2425.

magnumripper commented 5 years ago

I'm experimenting with something like this:

    if (log2(single_db->format->params.min_keys_per_crypt) > words_pair_max) {

        words_pair_max = log2(single_db->format->params.min_keys_per_crypt);

        log_event("- SingleWordsPairMax increased to %d for high KPC (%d)",
                  log2(single_db->format->params.min_keys_per_crypt),
                  single_db->format->params.min_keys_per_crypt);
    }

This may bump it, depending on actual runtime min_kpc regardless of if it's OpenMP, AVX-512 bitslice or a big bad GPU. And it wont ever end up too dramatic. But yes, I will test various settings against some data sets.

magnumripper commented 5 years ago

I went with the above for now. It's pretty sane: At a min. KPC of 1..127, it doesn't change the default of 6. At 128..255 it bumps to 7. At moderate GPU figures of eg. 32768..65535, it's 15 and with fast formats without internal mask that wants eg. 1M keys, we're up at 20. If anything, we could use a steeper increase rate.

magnumripper commented 5 years ago

I only now realized we can mix the types like this:

#if HAVE_OPENCL || HAVE_ZTEX
#define SINGLE_KEYS_TYPE        int32_t
#define SINGLE_KEYS_UTYPE       uint32_t
#define SINGLE_IDX_MAX          (INT32_MAX + 1U)
#define SINGLE_BUF_MAX          UINT32_MAX
#elif HAVE_OPENMP
#define SINGLE_KEYS_TYPE        int16_t
#define SINGLE_KEYS_UTYPE       uint32_t
#define SINGLE_IDX_MAX          0x8000
#define SINGLE_BUF_MAX          UINT32_MAX
#else
#define SINGLE_KEYS_TYPE        int16_t
#define SINGLE_KEYS_UTYPE       uint16_t
#define SINGLE_IDX_MAX          0x8000
#define SINGLE_BUF_MAX          0xffff
#endif

The middle case will limit us to a min_keys of 32768 but that should be enough for the OpenMP case. It halves the total memory use for the buffer, which is important at many salts. Best would be to pick type at run-time: Half-slow OpenCL formats like WPA-PSK runs just fine at 32768 but for faster ones it'd be a bad limit.

magnumripper commented 4 years ago

Some discussion about the log2 (or not): https://github.com/openwall/john/issues/4110#issuecomment-679109274

openwall / john

Single mode performance #3467