openwall / john

John the Ripper jumbo - advanced offline password cracker, which supports hundreds of hash and cipher types, and runs on many operating systems, CPUs, GPUs, and even some FPGAs
https://www.openwall.com/john/
Other
10.27k stars 2.1k forks source link

idea how to merge nt and nt-long, and implement other multi-limb hashes based on Merkle–Damgård construction #5534

Open AlekseyCherepanov opened 1 month ago

AlekseyCherepanov commented 1 month ago

Hashes based on Merkle–Damgård are computed in blocks (called limbs in john's sources): input is split into blocks of fixed size, then for each block compression function is applied to old state and the block producing new state, constant IV is used as old state for the first block. (Also there is length encoded into the last block. But it is not important for this idea.) It is applicable to raw-md4 and nt, raw-md5, raw-sha1, raw-sha256, and so on.

Fast and simple formats support lengths fitting into one block. Also such hashes support reversing of rounds because IV are constant. For two+ limbs, it is not possible to reverse rounds (as far as I know).

A simple format supporting lengths above one block might omit reversing. It would be slower for short hashes.

There might be two formats. Currently there are nt and nt-long for nt. User chooses the correct one. It is not easy to choose automatically because choice depends onto length of candidates that is not known beforehand in some cases.

Other approach is to allow binary to be found in db_main by both reversed hash and regular hash. It would require changes to formats interface. Also it would enlarge the db hurting cases like loading HIBP. (It is possible to have separate bitmaps for variants, so in theory using additional memory, it might be implemented without additional collisions.)

And I got the idea of hybrid approach that can be enclosed into format. Reversing is just a transformation on a full hash. We may apply it to any particular digest's value. Formats do apply it loading hashes. Moreover formats apply reversing in cmp_one on full hash computed by openssl's scalar hashing. So we may use reversing this way: short candidates would be hashed with shortened compression function, longer candidates would be hashed with full compression function and then reversing would be applied for lookup. This way short candidates would be as fast as with special format while longer candidates would be slower slightly than usual.

I think hybrid approach provides a viable trade-off between speed and usability. It would be even better with ability to disable reversing at run-time to speed up longer candidates.

125 ASCII chars for NT require 5 blocks of hashing (123 chars require 4 blocks). Everything above 1 block might be packed together for computations. So instead of storing by number of blocks (like all 2-block candidates, all 3-block candidates and so on), it is possible to have mixed sequence of limbs with marked ends. For instance, having vector size 2 it is possible to pack one 4-block candidate together with two 2-block candidates to hash 4 limbs only extracting 1 end after 2 limbs and another 2 ends after 2 more limbs. Simpler approach would be to compute 2 limbs for 2-block candidates and then 4 for lonely 4-block candidate. But the described mixed storage seems overcomplicated. Also 125 chars for raw-md5 fit into 3 blocks and mixed storage might be less profitable.

With or without reversing, hashing of candidates with mixed number of blocks requires some extra buffering to reduce number of incomplete vectors. (Better length-based queuing is not supported currently.)

magnumripper commented 1 month ago

IMO our biggest problem is md5crypt (both SIMD and OpenCL). Max length of 15 is seriously limited. Alas, I fail to understand that code so I simply can't fix it. I have tried many times to wrap my head around it. It seems I can't!

longer candidates would be hashed with full compression function and then reversing would be applied for lookup

We do exactly this already, in the NT kernel, and the multi-block kernel is surprisingly fast.

The bigger problem is lock stepping - like you say we need same number of blocks for all candidates in a batch (not necessarily a GWS batch, we might get away with the size of LWS. A warp? I can't remember). For SIMD, heterogenous (in terms of block count needed) batches are a much harder problem as they not merely run slower but fail completely unless a lot of code is added around it. We actually do that in several formats, but not any raw ones. It's not rocket science but it eats performance and is a proper PITA to code.

Back to OpenCL, I think a KISS hybrid NT-opencl could simply set a boolean in set_key() when any candidate in current batch is over a single block's length. Any such batches would call the slightly slower kernel. That trivial change should take us a long way but if memory serves me, I tried it and somehow the results must have been disappointing enough that I ended up not using it? I can't remember - need to revisit the old discussions.

AlekseyCherepanov commented 1 month ago

Scalar fallbacks might be added to all formats to handle lengthy candidates. So md5crypt would not skip candidates (remaining slow though) and separate md5crypt-long would become obsolete.

I forgot an available optimization: multi-block candidates starting with the same block can hash it once for all. Some modes like --mask='?w?a?a' can benefit from it greatly (especially keeping the hashed start between crypt_all() calls). But attacks like --mask='?a?a?w' cannot use it.

AlekseyCherepanov commented 1 month ago

I wrote a simplified md5crypt pseudocode based on md5crypt-long and my old attempt to implement it in john-devkit. The following is python-like pseudocode. I did not test it. Also bytes vs str types are wrong.

final = md5_raw(pw + s + pw)

final = md5_raw(pw + magic + s +  # magic is for standard md5crypt only
    (final * (125 // 16 + 1))[ : len(pw)] +  # the "weird" loop
    "".join(pw for bit in bin(len(pw)) if bit == '1'))

for i in range(0, 1000, 2):
    t = ''
    if i % 3: t += s
    if i % 7: t += pw
    final = md5_raw(final + t + pw)
    t = ''
    if (i + 1) % 3: t += s
    if (i + 1) % 7: t += pw
    final = md5_raw(pw + t + final)
result = final

The "weird" loop is to fill len(pw) long buffer with the first version of final. I wrote it as over-expansion + truncation by slice.

As far as I understand, the hashing depends only onto lengths of candidate and salt, and it does not depend onto data. The second md5_raw() adds pw multiple times for each bit of length. The main loop adds salt and/or password on some iterations in addition to guaranteed pw and current final.

Easy way is to group candidates by exact length. Then all hashing calls should have the same length of data.

The main loop could use bigger groups.

AlekseyCherepanov commented 1 month ago

I find it interesting to think about scalar branching as coding with co-routines. Imagine async/await or Promise + callback for continuation passing, e.g. digest = await md5_raw(...) in js/python or md5_raw(...).then((digest) => {...}) in imaginary node.js.

While co-routines are possible in C, usually logic is just inverted to do one iteration over explicit state (as in generate() in external mode). We'll need to enqueue hashing and return before actual hashing.

(Curiously it should be possible to use co-routines in js or lua (with luajit) to express ad-hoc dynamic formats, so scalar logic would be in a scripting language and hashing would be done in C with SIMD.)

So each candidate would have a state with space for hash, length of data for hashing, the data for hashing, and format's specific state like current loop's id and iteration in the loop. Then straight code is split into parts to operate over state and enqueue all that state for hashing. There should be different queues by number of blocks in data. When some queue gets full vector, actual hashing happens. After hashing, related states are iterated more. Like co-routines get resumed. To avoid recursion, outer loop is needed instead of direct call on addition to queue. If all states are waiting hashing and there are not full vectors then incomplete vectors should be hashed (probably starting from the most filled to hopefully fill others).

Currently simd-intersics.c interface cannot put results into different vectors. Also described approach would unpack vectors after each hashing. More complicated references to results could improve these points. Also hashing only one block at a time might or might not be desirable (cons: it can cause extra repacking for mixed lengths; pros: reuse of first blocks could be implemented on top of that).

Writing in such style can mix hashing from different iterations of algorithm, e.g. hashing with different number of rounds in one crypt_all() would be possible. But also it means that each state is processed for each candidate, while grouping of candidates with the same flow structure would use single state per vector. It might or might not be cheap on cpu, but gpus would have bigger penalty for that.

magnumripper commented 1 month ago

I think most of your reasoning is valid and indeed interesting - in theory. I'm not sure it would work in real world formats, because the ones needing it most (any raw hashing in SIMD or OpenCL) are the ones that are most sensitive to any added code. We've seen several times that very minor tweaks that intuitively should be good in fact resulted in regression.

AlekseyCherepanov commented 1 month ago

grouping of candidates with the same flow structure would use single state per vector

"Vector" here can be virtual and bigger than hardware SIMD vector. Interleaving is an example source for such vectors used in john.

Other possible use is to wrap each hashing in algorithm with its own loop over multiple sets of data with the same flow. Pushing this idea to extreme, it is possible to hash using opencl and keep scalar state on cpu. So we would branch/iterate on cpu with one state and invoke clEnqueueNDRangeKernel() for hashing of intermediate data of all related candidates. Number of candidates should be high to hide overhead of such calls. I don't have enough experience with opencl to tell if it can work for fast hashes. But for slow iterated hashes, this approach would mean a big loop on each iteration, so it should be unpractical with total time in crypt_all() much greater than now.

AlekseyCherepanov commented 1 month ago

Hm, I think a format can change its params.max_key_per_crypt in set_key() and cracker.c would use new value. At least, reducing the value should be ok, because increase might overflow crk_timestamps array (or I had to investigate the code better).

So a format might be able to use set_key() to pack vectors and hash them, then reduce mkpc to trigger crypt_all() and possible checkpoint before reaching full mkpc. Then in clear_keys() the format would increase mkpc to regular value.

OTOH with in-format queues for length-driven branching, format should use this trick only if all queues have only full vectors. Probability of that is not obvious to me.

And I even more unsure about usefulness for data-driven branching: for sunmd5 it would mean hashing as early as one full vector is obtained, then adding more candidates in case incomplete vectors. So either a few candidates are added and hashing mixes data from different iterations, or a full vector is added to be hashed on its own to catch up with previous hashing and to complete old vectors opportunistically. Also both ways seem more complicated than regular buffering.

AlekseyCherepanov commented 1 month ago

The trick with mkpc might be useful to utilize memory better: format could allocate one block per candidate (or even less) and just trigger crypt_all() earlier if long candidates consume all that memory.

AlekseyCherepanov commented 1 month ago

Other approach is to allow binary to be found in db_main by both reversed hash and regular hash. It would require changes to formats interface.

Actually it could be done with the existing interface: split() could yield 2 ciphertexts with marks for binary() to perform reversing or to keep hash without reversing.