Add SSE support for RAR5 (PBKDF2-HMAC-SHA256)

magnumripper commented 10 years ago

We can probably get another 2-3x boost with SSE but the 2x16 extra iterations needs some custom code. Either we make a 'local' version of it, or add optional "rar5 extras" to the shared code.

jfoug commented 10 years ago

I do not know the algorithm for rar5. Can you do a perl-like pseudo code in a couple lines? I fully understand the hmac and stuff from PBKDF2 and how to make it go 1/2 away (since I authored that speedup when I did mscash2), but not sure what the 2x16 extra iterations means.

The pbkdf2 code has become a bit fragile already, since it builds in so many ways, and sometimes has to build non-sse2 even though it is for an sse build, or build to do both sse and flat. But if we can make some tiny changes, it would be better to use the common code, than to fork off a rar specific. Using #defines is certainly an option, since it is in header files. That way, we can add the rar stuff, and have 0 impact on existing format, which should hopefully keep from breaking anything.

Would the pbkdf2 header files help in non-sse2 mode (again, I do not know rar5 format, so I do not know if the existing code is close to optimal or not).

magnumripper commented 10 years ago

It's normal PBKDF2 but with some extra iterations in the end. First, 32768 iterations result in Key. Then, another 16 iterations (without restarting) result in HashKeyValue and yet another 16 iterations result in PswCheckValue.

So HashKeyValue could be calculated by itself by doing 32784 iterations, or PswCheckValue by doing 32800. We obviously don't want to do it that way though :-)

magnumripper commented 10 years ago

The current code looks like this

static void rar5kdf(unsigned char *Pwd, size_t PwdLength,
            unsigned char *Salt, size_t SaltLength,
            unsigned char *Key, unsigned char *V1, unsigned char *V2, int Count)
{
    unsigned char SaltData[MaxSalt+4];
    unsigned char U1[SHA256_DIGEST_SIZE];
    unsigned char U2[SHA256_DIGEST_SIZE];
    unsigned char Fn[SHA256_DIGEST_SIZE]; // Current function value.
    int i;
    int j;
    int k;
    int CurCount[] = { Count - 1, 16, 16 };
    unsigned char *CurValue[] = { Key, V1, V2 };

    memcpy(SaltData, Salt, Min(SaltLength,MaxSalt));

    SaltData[SaltLength + 0] = 0; // Salt concatenated to 1.
    SaltData[SaltLength + 1] = 0;
    SaltData[SaltLength + 2] = 0;
    SaltData[SaltLength + 3] = 1;

    // First iteration: HMAC of password, salt and block index (1).
    hmac_sha256(Pwd, PwdLength, SaltData, SaltLength + 4, U1);
    memcpy(Fn, U1, sizeof(Fn)); // Function at first iteration.

    for (i = 0; i < 3; i++) // For output key and 2 supplementary values.
    {
        for (j = 0; j < CurCount[i]; j++) {
            // U2 = PRF(P, U1)
            hmac_sha256(Pwd, PwdLength, U1, sizeof(U1), U2);
            memcpy(U1, U2, sizeof(U1));
            for (k = 0; k < sizeof(Fn); k++) // Function ^= U.
                Fn[k] ^= U1[k];
        }
        memcpy(CurValue[i], Fn, SHA256_DIGEST_SIZE);
    }
}

That outer loop is confusing, unrolling it makes it easier to see:

static void rar5kdf(unsigned char *Pwd, size_t PwdLength,
            unsigned char *Salt, size_t SaltLength,
            unsigned char *Key, unsigned char *V1, unsigned char *V2, int Count)
{
    unsigned char SaltData[MaxSalt+4];
    unsigned char U1[SHA256_DIGEST_SIZE];
    unsigned char U2[SHA256_DIGEST_SIZE];
    unsigned char Fn[SHA256_DIGEST_SIZE]; // Current function value.
    int j;
    int k;

    memcpy(SaltData, Salt, Min(SaltLength,MaxSalt));

    SaltData[SaltLength + 0] = 0; // Salt concatenated to 1.
    SaltData[SaltLength + 1] = 0;
    SaltData[SaltLength + 2] = 0;
    SaltData[SaltLength + 3] = 1;

    // First iteration: HMAC of password, salt and block index (1).
    hmac_sha256(Pwd, PwdLength, SaltData, SaltLength + 4, U1);
    memcpy(Fn, U1, sizeof(Fn)); // Function at first iteration.

    for (j = 0; j < Count - 1; j++) {
        // U2 = PRF(P, U1)
        hmac_sha256(Pwd, PwdLength, U1, sizeof(U1), U2);
        memcpy(U1, U2, sizeof(U1));
        for (k = 0; k < sizeof(Fn); k++) // Function ^= U.
            Fn[k] ^= U1[k];
    }
    memcpy(Key, Fn, SHA256_DIGEST_SIZE);

    for (j = 0; j < 16; j++) {
        // U2 = PRF(P, U1)
        hmac_sha256(Pwd, PwdLength, U1, sizeof(U1), U2);
        memcpy(U1, U2, sizeof(U1));
        for (k = 0; k < sizeof(Fn); k++) // Function ^= U.
            Fn[k] ^= U1[k];
    }
    memcpy(V1, Fn, SHA256_DIGEST_SIZE);

    for (j = 0; j < 16; j++) {
        // U2 = PRF(P, U1)
        hmac_sha256(Pwd, PwdLength, U1, sizeof(U1), U2);
        memcpy(U1, U2, sizeof(U1));
        for (k = 0; k < sizeof(Fn); k++) // Function ^= U.
            Fn[k] ^= U1[k];
    }
    memcpy(V2, Fn, SHA256_DIGEST_SIZE);
}

jfoug commented 10 years ago

Is Key, V1 and V2 extracted from the rar file? If so, why the hell do we care about V1/V2 ??? Key is fine, and simply requires PBKDF2 (normal). Skipping the last 32 hmacs will not speed things up (much), BUT it would bring this format back into being just a simple normal hash.

magnumripper commented 10 years ago

No, Key is the AES key so we'd have the same deflate/early rejection problems as with pkzip and rar3. I'm not sure what V1 (HashKeyValue) is for, we don't use it (yet). Some files (hopefully most) supply a password verifier that we can check against V2 (PswCheckValue) so if we have it we can verify the password without decrypting/deflating at all.

So if we have such password verifier and calculate V2, we're back with a normal hash and normal cmp_*() functions. If we don't, we need to decrypt, deflate and calculate CRC or whatever. The latter is not supported at all yet and hopefully there will be little need.

So neither Key nor V1 is used at all right now but we can't skip calculating them anyway.

jfoug commented 10 years ago

Ok, well, I think we should be able to do those last 16 (or 32) iterations inside crypt_all after calling the pbkdf2 code. We may have to make a change or 2 to be able to pass in the output buffer (if it is not being passed in today). But I think we can use the common code without (much) change, then put this extra rar-custom stuff only in the rar code.

I do not have time currently, I am getting ready for a trip with my daughter, and I will not be doing any computer stuff more advanced than checking my smart phone a couple times a day.

magnumripper commented 10 years ago

Meanwhile I'm concentrating on doing an OpenCL version of it. Chances are it will be faster than RAR3 because PBKDF2 is a lot nicer to a GPU than the dreaded RAR3 code (which is password length dependant and impossible to run 32-bit aligned without sacrificing memory use).

magnumripper commented 10 years ago

Hmm wait a minute. The current version of the format never use Key nor V1, so we could just as well calculate only V2 (PswCheckValue) by doing a completely normal PBKDF2 using (count + 32) and be done with it!

magnumripper commented 10 years ago

we could just as well calculate only V2 (PswCheckValue) by doing a completely normal PBKDF2 using (count + 32) and be done with it!

OpenCL format committed, using that approach. Works like a champ.

jfoug commented 10 years ago

I started on rar5 (CPU), without the change to only compute V2. I will probalby back out some of these changes, and only do the V2.

NOTE, there is NO sse2 yet. I simply have started to setup to use pbkdf2_hmac_sha256.h code. It is only doing CPU now. Also, I had to remove some openmp defines (such as multiple keys in non-CPU version).

fd08f82

This code was about 1 / s faster on my test box (non-omp). It went from 34.4 to 35.7 It would probably improve more, by simply doing pbkdf2_sha256(x,x,x,x,count+32,key, 32,0)

To get Key, then V1, V2, I had to expose some additional state information, i.e. the final context value without being xor'd into Key. I do not think I am taking that out of the code. It 'could' be needed elsewhere. It certainly would be needed here, IF we needed all 3 values.

jfoug commented 10 years ago

Here is the entire (CPU) version of the function, if we only capture V2:

static void rar5kdf(unsigned char *Pwd, size_t PwdLength,
            unsigned char *Salt, size_t SaltLength,
            unsigned char *Key, unsigned char *V1, unsigned char *V2, int Count)
{
    pbkdf2_sha256(Pwd, PwdLength, Salt, SaltLength, Count+32, V2, SHA256_DIGEST_SIZE, 0);
}

Getting THAT ported to SSE2 should be trivial. We will have to pass in 4 keys. It probably would be MUCH easier to dump the kdf function, and simply do this within crypt_all like a 'normal' function would do. Then we can do index += idx_granularity for SSE data.

Good stuff.

jfoug commented 10 years ago

$ ../run/john -test=5 -form=rar5 Benchmarking: RAR5 [PBKDF2-SHA256 32/32]... DONE Raw: 35.8 c/s real, 35.8 c/s virtual

$ ../run/john -test=5 -form=rar5 Benchmarking: RAR5 [PBKDF2-SHA256 128/128 SSE4.1 4x]... DONE Raw: 140 c/s real, 141 c/s virtual

OMP busted. I will get that fixed and check this in.

jfoug commented 10 years ago

472626f

This checks in the SSE2 code. NOTE, OMP is broken. It dies on the 2nd loop (index 4). I have looked, but can NOT see a MT issue. Can someone please see if they can find the bug here. The code runs perfectly fine when run in OMP_NUM_THREADS=1 mode.

jfoug commented 10 years ago

With this version, we are OMP compatible also.

3f2d0a6

jfoug commented 10 years ago

$ ../run/john -test=5 -form=rar5 Benchmarking: RAR5 [PBKDF2-SHA256 32/32]... DONE Raw: 35.8 c/s real, 35.8 c/s virtual

$ ../run/john -test=5 -form=rar5 Benchmarking: RAR5 [PBKDF2-SHA256 128/128 SSE4.1 4x]... DONE Raw: 140 c/s real, 141 c/s virtual

$ OMP_NUM_THREADS=8 ../run/john -test -form=rar5 Will run 8 OpenMP threads Benchmarking: RAR5 [PBKDF2-SHA256 128/128 SSE4.1 4x]... (8xOMP) DONE Raw: 574 c/s real, 75.7 c/s virtual

And on my 64 bit VM:

$ OMP_NUM_THREADS=1 ../run/john -test=5 -form=rar5Warning: OpenMP is disabled; a non-OpenMP build may be faster Benchmarking: RAR5 [PBKDF2-SHA256 128/128 SSE4.1 4x]... DONE Raw: 120 c/s real, 120 c/s virtual

$ OMP_NUM_THREADS=4 ../run/john -test=5 -form=rar5 Will run 4 OpenMP threads Benchmarking: RAR5 [PBKDF2-SHA256 128/128 SSE4.1 4x]... (4xOMP) DONE Raw: 399 c/s real, 105 c/s virtual

jfoug commented 10 years ago

I have also tested x86 code, in OMP mode. Works fine.

Is there any way within ./configure to build a non-SSE version? What I did was uncomment the #define PBKDF2_HMAC_SHA256_ALSO_INCLUDE_CTX and then changed the for loop in crypt_all to do a ++ instead of += MAX_CRYPTS_KEY That builds a x86 (still screen shows sse2, but it is not). I know there must be a better way, but i could not get it working. --disable-native-tests still builds sse2 for my 64 bit system.

But as the code stands, it is faster, does SSE2, is OMP 'safe'. I think this issue is done.

magnumripper commented 10 years ago

Good stuff :+1:

openwall / john

Add SSE support for RAR5 (PBKDF2-HMAC-SHA256) #721