mholt / archiver

DEPRECATED. Please use mholt/archives instead.
https://github.com/mholt/archives
MIT License
4.45k stars 392 forks source link

DMG/`bzip2` file read error: `bzip2: corrupted input: invalid stream magic` #405

Closed rgmz closed 5 months ago

rgmz commented 5 months ago

What version of the package or command are you using?

v4.0.0-alpha.8

What are you trying to do?

Decompress a .dmg file detected as .bzip2.

$ file -i BitcoinTemplate.dmg
BitcoinTemplate.dmg: application/x-bzip2; charset=binary

Note that there is a warning about "trailing garbage" when decompressing it using the CLI, however, it still succeeds.

$ bzip2 -d BitcoinTemplate.dmg
bzip2: Can't guess original name for BitcoinTemplate.dmg -- using BitcoinTemplate.dmg.out

bzip2: BitcoinTemplate.dmg: trailing garbage after EOF ignored

What steps did you take?

  1. Download https://github.com/cicorias/bitcoin/raw/a4f2c8419f5a8b47c4b48e962b4ff02885164c9b/contrib/BitcoinTemplate.dmg
  2. Run the following test

    func TestDmg(t *testing.T) {
        f, err := os.Open("/tmp/BitcoinTemplate.dmg")
        require.NoError(t, err)
    
        format, f2, err := archiver.Identify("/tmp/BitcoinTemplate.dmg", f)
        require.NoError(t, err)
    
        rc, err := format.(archiver.Decompressor).OpenReader(f2)
        require.NoError(t, err)
    
        _, err = io.ReadAll(rc)
        require.NoError(t, err)
    
        assert.NoError(t, rc.Close())
    }
  3. Receive error from io.ReadAll
     Error:         Received unexpected error:
                     bzip2: corrupted input: invalid stream magic

What did you expect to happen, and what actually happened instead?

I expected the file to be identified and decompressed successfully.

How do you think this should be fixed?

Unsure.

Please link to any related issues, pull requests, and/or discussion

https://github.com/trufflesecurity/trufflehog/issues/2935

Bonus: What do you use archiver for, and do you find it useful?

mholt commented 5 months ago

As the docs say:

If stream is non-nil then the returned io.Reader will always be non-nil and will read from the same point as the reader which was passed in; it should be used in place of the input stream after calling Identify() because it preserves and re-reads the bytes that were already read during the identification process.

You're passing in a stream that gets read, but ignoring the returned stream (_) causing corruption to the reader.

rgmz commented 5 months ago

Hi Matt,

That appears to be a typo on my part from trying to condense the real messy code into a neat reproducer.

The error still occurs when the returned stream isn't ignored. I've updated the example accordingly.

mholt commented 5 months ago

Oh, ok. I will take a look.

mholt commented 5 months ago

I get the same error with that file, but I also do when just using the underlying bzip2 lib directly. I would open the issue with @dsnet -- he's quite smart and can probably tell you whether it's a bug in bzip2 package or possibly this DMG is malformed.

rgmz commented 5 months ago

Will do. Thank you for the insight!

dsnet commented 5 months ago

Hey, thanks for the ping.

It seems to me that the Go stdlib compress/bzip2 package, github.com/dsnet/compress/bzip2 package, and bzip2 CLI tool all agree that the input is faulty, but they all seem to emit the same number of bytes before reporting an error.

This was my test program:

1447936 865689b5f5a000820a8d372f2e09f154 bzip2 data invalid: bad magic value in continuation file
1447936 865689b5f5a000820a8d372f2e09f154 bzip2: corrupted input: invalid stream magic
1447936 865689b5f5a000820a8d372f2e09f154 bzip2: (stdin): trailing garbage after EOF ignored

Strictly speaking, the CLI does not report an error (in terms of a non-zero status code, but does spit out a warning to stderr).

Given that data is at least produced, perhaps this requires archiver to ignore a corrupted input error? That said, I don't see you could possibly distinguish between an actual data corruption error versus some other semantic data that happens to be at the end.

mholt commented 5 months ago

Thanks Joe -- that's a smart idea for troubleshooting.

Hmm, yeah, that is tricky then. The file is kind of invalid then? I am not sure how we are going to distinguish that. For now I'm inclined to leave archiver unchanged in this regard and reject faulty inputs. Unless another possibility is introduced.

Thanks again for chiming in!

rgmz commented 5 months ago

That said, I don't see you could possibly distinguish between an actual data corruption error versus some other semantic data that happens to be at the end.

Funny you should say that. Taking a closer look at the file, it seems like the trailing data is a quirk of .dmg files embedding their PLIST at the end. I found the same thing in older bzip2-based DMG files like transmission-2.61.dmg and newer zlib-based ones like Rectangle0.80.dmg.

$ strings BitcoinTemplate.dmg
...
@@P@
BZh11AY&SYe7
B+"T
@<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
        <key>resource-fork</key>
        <dict>
                <key>blkx</key>
                <array>
                        <dict>
                                <key>Attributes</key>
                                <string>0x0050</string>
... (continues for about 100 more lines)

It seems like DMG files can be a variety of compression formats.

The first noteable fact about the DMG file format is, that there is no DMG file format. DMGs come in a variety of sub-formats, corresponding to the different tools which create them, and their compression schemes. ... The common denominator of most of these is the existence of a 512-byte trailer at the end of the file. This trailer is identifiable by a magic 32-bit value, 0x6B6F6C79, which is "koly" in ASCII.

http://www.newosxbook.com/DMG.html

$ strings Rectangle0.80.dmg | tail -n 2
</plist>
koly

$ strings transmission-2.61.dmg | tail -n 2
</plist>
koly

$ strings BitcoinTemplate.dmg | tail -n 2
</plist>
koly

Edit: Interestingly, some DMG files are structured in a way where there's no compression/archive headers.

# https://www.jetbrains.com/idea/download/?section=mac
$ file -i ideaIU-2024.1.2.dmg
ideaIU-2024.1.2.dmg: application/octet-stream; charset=binary
rgmz commented 5 months ago

Hmm, yeah, that is tricky then. The file is kind of invalid then? I am not sure how we are going to distinguish that. For now I'm inclined to leave archiver unchanged in this regard and reject faulty inputs. Unless another possibility is introduced.

@mholt I agree with rejecting faulty inputs, it wouldn't make sense to change this behaviour in general.

It could make sense to add some special handling for for *.dmg files, such as special logic to ignore the trailing XML + koly during read, or an error saying "DMG files not supported" to prevent confusion. That's obviously up to you.

mholt commented 5 months ago

Sure; but we can only do that if we can seek to the end of the stream -- turns out only a few streams can do that (mainly files and buffers). So we'd only be able to offer a helpful error message in some cases I suppose, but ideally we'd also couple this with a check on the error type rather than relying on string comparison.

rgmz commented 5 months ago

Sure; but we can only do that if we can seek to the end of the stream -- turns out only a few streams can do that (mainly files and buffers).

So possible but not easy or reliable. The filename is probably the easiest/most reliable way to identify a *.dmg file. ¯\(ツ)

but ideally we'd also couple this with a check on the error type rather than relying on string comparison.

Can you elaborate what you mean by this? I'm happy to open a PR addressing this, with a little guidance.

dsnet commented 5 months ago

Can you elaborate what you mean by this?

@mholt is referring to the ability to distinguishing error values without resorting to doing some form of string parsing on error.Error. The github.com/dsnet/compress package has an Error interface with an IsCorrupted boolean method. The error API was designed before the invention the more recent errors.Is APIs, but suspect that they are compatible.

mholt commented 5 months ago

Ah yeah that's what I mean. If there's already a specific error value/type then that's perfect, we can probably roll with that. And just seek when it is a Seeker.