magnumripper commented 3 years ago

See #4541 and #4300

I intend to look into this soon. I'm thinking these checksums are likely used somewhere for something other than our early rejection when cracking them - so there should be code somewhere that "does the right thing".

Sadly, zipinfo does not list any checksum at all. It does list CRC but the checksum we're interested in here is a separate checksum that is known without decompressing the whole file (hence our chance to early reject).

As noted in https://github.com/openwall/john/issues/4300#issuecomment-676179682, normal zipinfo/unzip tools doesn't seem to complain about a 2-byte checksum error. So are they always expecting a 1-byte checksum, or do they know just what we are missing here? Fortunately I believe there's a lot of open source to look at, in addition to fairly extensive documentation about the file format.

solardiz commented 3 years ago

normal zipinfo/unzip tools doesn't seem to complain about a 2-byte checksum error. So are they always expecting a 1-byte checksum, or do they know just what we are missing here?

It's also possible they disregard this checksum, with unzip only checking the full CRC available post decompression. Just a guess. We definitely need to actually take a look. Also run some tests of those other tools (and with closed source ones, too) using archives with (partially) corrupted checksums.

magnumripper commented 3 years ago

According to APPNOTE.TXT, "Versions of PKZIP prior to 2.0 used a 2 byte CRC check; a 1 byte CRC check is used on versions after 2.0. This can be used to test if the password supplied is correct or not."

On a side note I'd prefer them saying "on or after 2.0" but they do say "after 2.0" and the phrasing is somewhat different yet equally unclear in older versions of the spec.

So version 2.0 [needed to extract] or later simply means only a 1-byte check is safe. My guess is that @jfoug observed some newer-version archives did have a 2-byte check and wanted to exploit it for the extra performance, trying to define them by the efh's. Well, it's rewarding when it works - but not so much when it doesn't.

magnumripper commented 3 years ago

The fix provided in #4574 not only disables 2-byte checks for some archives - it also enables them for others: Prior to this, any archive without any efh fields would get the single-byte check. Maybe the APPNOTE at the time Jim wrote the original code (many years ago!) simply wasn't very clear about the matter.

solardiz commented 3 years ago

Thank you for figuring this out, @magnumripper! I am puzzled, though: PKZIP 2.0 is ancient. As I recall, 2.04g is from 1993. Do you really have (m)any password protected archives created by versions older than that? Or did e.g. this popular version put some older version identifier into the archives (as minimum version needed to extract)?

solardiz commented 3 years ago

a 1 byte CRC check is used on versions after 2.0

Is it possible that by "after" they meant e.g. 2.1+? So the popular PKZIP 2.04g for DOS still used 2-byte? This should be easy to check.

magnumripper commented 3 years ago

I'm pretty sure they mean on or after 2.0, but yes we're talking "version needed to extract" here (which is also the version we show in the output): Even the latest tools may produce such archives as long as you don't use some feature of newer date.

From pkware APPNOTE-1.0

After the header is decrypted, the last two bytes in Buffer
should be the high-order word of the CRC for the file being
decrypted, stored in Intel low-byte/high-byte order.  This can
be used to test if the password supplied is correct or not.

From pkware APPNOTE-2.0

After the header is decrypted,  the last 1 or 2 bytes in Buffer
should be the high-order word/byte of the CRC for the file being
decrypted, stored in Intel low-byte/high-byte order.  Versions of
PKZIP prior to 2.0 used a 2 byte CRC check; a 1 byte CRC check is
used on versions after 2.0.  This can be used to test if the password
supplied is correct or not.

From InfoZIP APPNOTE

After the header is decrypted,  the last 1 or 2 bytes in Buffer
should be the high-order word/byte of the CRC for the file being
decrypted, stored in Intel low-byte/high-byte order, or the high-order
byte of the file time if bit 3 of the general purpose bit flag is set.
Versions of PKZIP prior to 2.0 used a 2 byte CRC check; a 1 byte CRC check is
used on versions after 2.0.  This can be used to test if the password
supplied is correct or not.

Now, I think I'll dig just a little bit further into that "or the high-order byte of the file time if bit 3 of the general purpose bit flag is set" clause. We do use the file time for single-byte checks but I'm not aware of any situation we ever use one byte of the CRC (and we never did). I do believe the latest text is most correct and that it means to say "either two bytes of CRC or one byte of file time" and that is exactly what we do AFAIK.

magnumripper commented 3 years ago

InfoZIP version again. This is based on the PKZIP version but "unofficially corrected and extended by Info-ZIP".

          Bit 3: If this bit is set, the fields crc-32, compressed
                 size and uncompressed size are set to zero in the
                 local header.  The correct values are put in the
                 data descriptor immediately following the compressed
                 data.  (Note: PKZIP version 2.04g for DOS only
                 recognizes this bit for method 8 compression, newer
                 versions of PKZIP recognize this bit for any
                 compression method.)
                [Info-ZIP note: This bit was introduced by PKZIP 2.04 for
                 DOS. In general, this feature can only be reliably used
                 together with compression methods that allow intrinsic
                 detection of the "end-of-compressed-data" condition. From
                 the set of compression methods described in this Zip archive
                 specification, only "deflate" meets this requirement.
                 Especially, the method STORED does not work!
                 The Info-ZIP tools recognize this bit regardless of the
                 compression method; but, they rely on correctly set
                 "compressed size" information in the central directory entry.]

magnumripper commented 3 years ago

From actual source code (unzip 5.52)

    /* If last two bytes of header don't match crc (or file time in the
     * case of an extended local header), back up and just copy. For
     * pkzip 2.0, the check has been reduced to one byte only.
     */
#ifdef ZIP10
    if ((ush)(c0 | (c1<<8)) !=
        (z->flg & 8 ? (ush) z->tim & 0xffff : (ush)(z->crc >> 16))) {
#else
    c0++; /* avoid warning on unused variable */
    if ((ush)c1 != (z->flg & 8 ? (ush) z->tim >> 8 : (ush)(z->crc >> 24))) {
#endif

I find it interesting that there's an #ifdef here as opposed to some check of archive version. I take it unzip(1) version 2.0 and above will only check a single byte regardless of what version was used for creating the archive.

magnumripper commented 3 years ago

OK this is also interesting: Our format will know both the CRC and the timestamp, but it doesn't try to understand which one to use for early reject - it checks both:

            /* if the hash is a 2 byte checksum type, then check that value first */
            /* There is no reason to continue if this byte does not check out.  */
            if (salt->chk_bytes == 2 && C != (e&0xFF) && C != (e2&0xFF))
                goto Failed_Bailout;

            C = PKZ_MULT(*b++,key2);
#if 1
            // https://github.com/openwall/john/issues/467
            // Fixed, JimF.  Added checksum test for crc32 and timestamp.
            if (C != (e>>8) && C != (e2>>8))
                goto Failed_Bailout;
#endif

(e is the CRC, e2 is the timestamp).

This means we can get better early rejection, provided we get this right.

magnumripper commented 3 years ago

The current hash format doesn't include the "version needed to extract" though, so we need to think carefully about how to fix this

magnumripper commented 3 years ago

467 history is interesting with my newer insights

magnumripper commented 3 years ago

Without changing the hash format, I guess zip2john would need to fill in the correct checksum (i.e. either the CRC or the timestamp) in both fields. Or just in the first field, and have the format never bother looking in the second field.

Just in case we don't get this 100% right now (although I think we can) I think it would be nice to change the format to $pkzip3$ and start adding the "version needed to extract" as well as that flag bit 3 (in case it's not there already, not sure yet).

Edit: We could even make it $pkzip%u\$ with %u being version needed to extract as-is (so eg. 10 or 20 for v 1.0 or 2.0).

magnumripper commented 3 years ago

Sum-up this far. Version is the one "needed to extract", TS is timestamp, CRC is the CRC-32 and CS is whatever early-reject checksum we have, which is 8 or 16 bits of one of the two former.

Current situation:

Neither the version or the general purpose flag is stored in the current hash format 😢 .
zip2john stores 16 bits of TS and 16 bits of CRC for all files (the full 32-bit CRC is a separate field), plus a field for 1 or 2 byte checking.
The pkzip format honors that latter field (1 or 2 bytes) but always checks TS and CRC and only bails out if both are different from whatever we happen to have (more or less halving the number of early rejects, right?).

This means the current situation is 100% correct in terms of cracking, but suboptimal in terms of early rejection.

To optimize early rejection:

If bit 3 of the general purpose bit flag is set, use timestamp as CS. Otherwise use the CRC.
If ver < 20, store and use 16 bits of CS, otherwise store and/or use just 8 of them.

solardiz commented 3 years ago

If we introduce a new zip2john output format, then let's encode more into it just in case. So continue to always include 16 bits for both TS and CRC. Add the version needed to extract. Add the entire flags field, not just bit 3. Add all EFH values just in case. I think these are generally not security-sensitive, so let's have them. (Ideally, we'd avoid having the actual encrypted data in there, but we can't.)

magnumripper commented 3 years ago

Just brainstorming here, I don't really expect anyone to read this.

Current format (from source code comments):

 * filename:$pkzip2$C*B*[DT*MT{CL*UL*CR*OF*OX}*CT*DL*CS*TC*DA]*$/pkzip2$   (new format, with 2 checksums)
 * All numeric and 'binary data' fields are stored in hex.
 *
 * C   is the count of hashes present (the array of items, inside the []  C can be 1 to 8.).
 * B   is number of valid bytes in the checksum (1 or 2).  Unix zip is 2 bytes, all others are 1 (NOTE, some can be 0)
 * ARRAY of data starts here
 *   DT  is a "Data Type enum".  This will be 1 2 or 3.  1 is 'partial'. 2 and 3 are full file data (2 is inline, 3 is load from file).
 *   MT  Magic Type enum.  0 is no 'type'.  255 is 'text'. Other types (like MS Doc, GIF, etc), see source.
 *     NOTE, CL, DL, CRC, OFF are only present if DT != 1
 *     CL  Compressed length of file blob data (includes 12 byte IV).
 *     UL  Uncompressed length of the file.
 *     CR  CRC32 of the 'final' file.
 *     OF  Offset to the PK\x3\x4 record for this file data. If DT == 2, then this will be a 0, as it is not needed, all of the data is already included in the line.
 *     OX  Additional offset (past OF), to get to the zip data within the file.
 *     END OF 'optional' fields.
 *   CT  Compression type  (0 or 8)  0 is stored, 8 is imploded.
 *   DL  Length of the DA data.
 *   CS  2 bytes of checksum data.
 *   TC  2 bytes of checksum data (from timestamp)
 *   DA  This is the 'data'.  It will be hex data if DT == 1 or 2. If DT == 3, then it is a filename (name of the .zip file).
 * END of array item.  There will be C (count) array items.
 * The format string will end with $/pkzip$

The comment "Unix zip is 2 bytes, all others are 1 (NOTE, some can be 0)" is, as we now know, incorrect. Correct wording is "Pre v2.0 zip is 2 bytes, all others are 1" and zip2john would never output 0 there.

What we could do is this:

Change zip2john to output only the correct check value (CS or TC) and always do so in the [current] CS field, while always outputting 0000 to the TC field. If an older format reads such hash, it will still work.
Change the format to assume that a TC of 0000 means it should only use the CS field.
If the updated format reads an older hash, it will still work fine unless the TC just happens to actually be 0000 and the one to use. In that case, only, we might get false negatives. Then again, using older output from zip2john is subject to get the 1-or-2-bytes thing incorrect anyway so all bets were already off.

magnumripper commented 3 years ago

No wait, here's the original format that we still support:

 * filename:$pkzip$C*B*[DT*MT{CL*UL*CR*OF*OX}*CT*DL*CS*DA]*$/pkzip$

We could simply revert to outputting that format, just with the correct CS (sometimes half the CRC, sometimes the timestamp). I think this is the way to go.

magnumripper commented 3 years ago

OK so a problem is that Jim defined the one- or two-byte checking at archive level in the hash input format, while it actually needs to be enumerated for each file in the archive. Duh. We could work around it by using single-byte if any of the files are single-byte but maybe it's time to move on to a new format.

If we introduce a new zip2john output format, then let's encode more into it just in case. So continue to always include 16 bits for both TS and CRC. Add the version needed to extract. Add the entire flags field, not just bit 3. Add all EFH values just in case. I think these are generally not security-sensitive, so let's have them. (Ideally, we'd avoid having the actual encrypted data in there, but we can't.)

Definitely. While extraneous data could be bad for some formats (making actually equal non-hashes unequal in terms of john.pot reading), this should not be the case here so if we end up with that route I'll be sure to include everything remotely usable, at individual level when applicable.

I need to sleep on this though. Many good things has been the outcome of a good sleep.

magnumripper commented 3 years ago

Anyway I have already established that once we have this fixed, we'll have ended up with better performance in most cases as opposed to worse. Me like.

openwall / john

Dig deeper in pkzip format's one-or-two byte checksums #4571

467 history is interesting with my newer insights