mwotton / lz4hs

Haskell bindings to lz4
BSD 3-Clause "New" or "Revised" License
21 stars 7 forks source link

Decompressed file truncated #5

Closed Reilithion closed 11 years ago

Reilithion commented 11 years ago

When I compress and then decompress some binary data, the decompressed version looks truncated. I've reduced the problem down as far as I can into the following test case:

-- testcase.hs import Data.Maybe import qualified Data.Binary as B import qualified Data.Trie as T

import qualified Data.ByteString.UTF8 as UTF8 import qualified Codec.Compression.LZ4 as LZ4

import qualified Data.ByteString as BS import qualified Data.ByteString.Lazy as LBS

lazyToStrictBS x = BS.concat $ LBS.toChunks x

data GenerationParameters = GenerationParameters { genIterations :: Int , genLength :: Int , genHash :: String , genSite :: String } deriving (Show)

instance B.Binary GenerationParameters where put x = do B.put $ genIterations x B.put $ genLength x B.put $ genHash x B.put $ genSite x get = do i <- B.get l <- B.get h <- B.get s <- B.get return GenerationParameters { genIterations = i, genLength = l, genHash = h, genSite = s }

exampleTrie = let fb = GenerationParameters { genIterations = 1, genLength = 8, genHash = "SHA1", genSite = "facebook.com" } gm = GenerationParameters { genIterations = 9, genLength = 12, genHash = "Tiger", genSite = "gmail.com" } ex = GenerationParameters { genIterations = 3, genLength = 15, genHash = "Whirlpool", genSite = "example.com" } params = [fb, gm, ex] in T.fromList $ map (\gp -> (UTF8.fromString $ genSite gp, gp)) params

main = do B.encodeFile "plain.dat" exampleTrie BS.writeFile "compressed.dat" . fromJust . LZ4.compress . lazyToStrictBS . B.encode $ exampleTrie putStrLn "Wrote compressed.dat" cf <- BS.readFile "compressed.dat" let Just df = LZ4.decompress $ cf BS.writeFile "decompressed.dat" df putStrLn "Wrote decompressed.dat"

-- EOF

And here is what I see after the above runs:

reilithion@cc:/tmp/testcase% diff plain.dat decompressed.dat Binary files plain.dat and decompressed.dat differ reilithion@cc:/tmp/testcase% hexdump -C plain.dat 00000000 02 64 02 01 00 00 00 00 00 00 00 0b 65 78 61 6d |.d..........exam| 00000010 70 6c 65 2e 63 6f 6d 01 00 00 00 00 00 00 00 03 |ple.com.........| 00000020 00 00 00 00 00 00 00 0f 00 00 00 00 00 00 00 09 |................| 00000030 57 68 69 72 6c 70 6f 6f 6c 00 00 00 00 00 00 00 |Whirlpool.......| 00000040 0b 65 78 61 6d 70 6c 65 2e 63 6f 6d 00 02 66 01 |.example.com..f.| 00000050 01 00 00 00 00 00 00 00 0c 66 61 63 65 62 6f 6f |.........faceboo| 00000060 6b 2e 63 6f 6d 01 00 00 00 00 00 00 00 01 00 00 |k.com...........| 00000070 00 00 00 00 00 08 00 00 00 00 00 00 00 04 53 48 |..............SH| 00000080 41 31 00 00 00 00 00 00 00 0c 66 61 63 65 62 6f |A1........facebo| 00000090 6f 6b 2e 63 6f 6d 00 01 00 00 00 00 00 00 00 09 |ok.com..........| 000000a0 67 6d 61 69 6c 2e 63 6f 6d 01 00 00 00 00 00 00 |gmail.com.......| 000000b0 00 09 00 00 00 00 00 00 00 0c 00 00 00 00 00 00 |................| 000000c0 00 05 54 69 67 65 72 00 00 00 00 00 00 00 09 67 |..Tiger........g| 000000d0 6d 61 69 6c 2e 63 6f 6d 00 |mail.com.| 000000d9 reilithion@cc:/tmp/testcase% hexdump -C decompressed.dat 00000000 02 64 02 01 00 00 00 00 00 00 00 0b 65 78 61 6d |.d..........exam| 00000010 70 6c 65 2e 63 6f 6d 01 00 00 00 00 00 00 00 03 |ple.com.........| 00000020 00 00 00 00 00 00 00 0f 00 00 00 00 00 00 00 09 |................| 00000030 57 68 69 72 6c 70 6f 6f 6c 00 00 00 00 00 00 00 |Whirlpool.......| 00000040 0b 65 78 61 6d 70 6c 65 2e 63 6f 6d 00 02 66 01 |.example.com..f.| 00000050 01 00 00 00 00 00 00 00 0c 66 61 63 65 62 6f 6f |.........faceboo| 00000060 6b 2e 63 6f 6d 01 00 00 00 00 00 00 00 01 00 00 |k.com...........| 00000070 00 00 00 00 00 08 00 00 00 00 00 00 00 |.............| 0000007d

mwotton commented 11 years ago

haven't forgotten about this - just a bit occupied over christmas. will have a better look soon.

mwotton commented 11 years ago

(as a note, if you can write these as explicit tests in the testsuite it's a lot easier to get started on fixing)

mwotton commented 11 years ago

ok, have narrowed the bug down to a test case. @thoughtpolice - any idea what's going on? I've added the test to the suite.

thoughtpolice commented 11 years ago

I'm looking at this for the moment, but nothing yet. I went ahead and updated lz4 to a newer revision, but that didn't fix it. I'll try and reproduce a test case in C to narrow down what's going on.

cartazio commented 11 years ago

if i wanted to try to understand this bug, where would i start?

mwotton commented 11 years ago

@cartazio i suspect it's probably a C problem. I'd suggest getting a C case that fails...

thoughtpolice commented 11 years ago

Basically, I'd advise a minimal program that just attempts to decompress the given example buffer from C, and from Haskell. This was about the place I was at a few months ago when I looked into this. I likely still have the C program on my hard drive at home; I'll look when I get off work. The main reason I'd suggest a C program is that it's much more tractable (and far easier) to inspect the C program under gdb or somesuch for bugs, if you can rule out the Haskell side as the culprit (or, alternatively, you want proof it is the Haskell side.)

You can find the invalid buffer here: https://github.com/mwotton/lz4hs/blob/master/tests/Properties.hs#L29 which should be pretty easy to play with. I believe I just dumped the original and truncated values into a binary file, and played around that way.

At the time, I don't believe I had ruled out any lz4 bugs in my analysis - although I like to think those types of guys are a lot better at writing code than me. :) I also only had the resources to investigate for about a day or two at the time. So I speculate I could have possibly missed something in the FFI binding...

I'm on IRC 24/7 on #haskell and related channels under this username; feel free to ping me questions and I'll reply ASAP, or work with you to try and fix this bug over the weekend.

PS. Looking at the output of the encoded Tries above, you can see here the truncation occurs at offset 0x70. In the original compressed file, we have:

00000070 | 00 00 00 00 00 08 00 00 00 00 00 00 00 04 53 48

While in the decompressed file, we have:

00000070 | 00 00 00 00 00 08 00 00 00 00 00 00 00

The bytes 0x04 0x53 0x48 correspond to the ASCII bytes "SHA", which is the beginning of "SHA1", which is one of the components of the constructed trie (in the genHash field.) Why it is truncated here, I'm not sure, but it's a hint that this identifier may be unique in some way. We should try changing all the strings in all the fields with the example, and see if the bug persists. That's an easy way to see something crazy happen.

Finally, note that lz4 has seen many updates since this bug was filed; it's on revision 87+ now, as opposed to r75 which is included in the package. When I updated this package to a newer version, the bug persisted. It's worth trying to update lz4's C code first with the package, and see what happens.

Hope that helps.

- A

cartazio commented 11 years ago

Ok, i've tried using the updated code from lz4, and that bug still holds. Interesting, the same problem happens when I change the test suite example to use compressHC, which on the compress side SHOULD be different code paths.

This means that perhaps the problem is with the decompress code? I'll play with it some more.

Cyan4973 commented 11 years ago

I'm sure interested if you can find a bug in the C routine.

I've had a quick look at the link proposed, and noticed that, just before the truncated part ("SHA1") there is an escape character : \EOT

To give more context : \NUL\NUL\NUL\NUL\EOTSHA1

What is \EOT ?

cartazio commented 11 years ago

@Cyan4973 thats the 004 byte, as a text byte in ascii its rendered as the end of transmission char.

The thing you have to remember is that this is a byte (well byte and nibble) oriented binary data compression format, not a textual format, so any character encoding rendering is incidental. (At least as far as I understand things)

Cyan4973 commented 11 years ago

OK, clear, so the bytes 0x04 0x53 0x48 corresponds to \EOT S H.

any character encoding rendering is incidental

From an LZ4 pespective, sure, but what about the test program itself ?

For example, should you try to pipe "as is" a binary stream to a program under windows, it would get "interpreted" on the fly by windows, doing weird things to Return Carriage & Line Feed characters and such. It will only work as pure binary if the following instruction is given to the program (in C) :

ifdef _WIN32 // Need to set stdin/stdout to binary mode specifically for windows

    _setmode( _fileno( stdout ), _O_BINARY );

endif

That's just an example. I don't know how it works in Haskell. Just a warning that sending a "stream of characters" and expecting them to behave as an "unmodified stream of bytes" might sometimes lead to strange results.

Anyway, in order to progress on the reported error case, i would suggest : 1) to play with the \EOT character, and see if it truncates at other positions 2) to produce a binary file which contains the sequence expressed in the test program. This will allow to test it on a larger set of ports (Java, C#, Python, etc.)

cartazio commented 11 years ago

@Cyan4973 I've done the testing on mac so windows quirks are irrelevant.

likewise, bytestrings in haskell are literally bytearrays, that they can be treated as extended ascii is accidental.

heres the encoding of that test text that @mwotton linked to from the test suite as a sequence of 8bit numbers. (by using Bytestring.unpack on the text )

[2,100,2,1,0,0,0,0,0,0,0,11,101,120,97,109,112,108,101,46,99,111,109,1,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,15,0,0,0,0,0,0,0,9,87,104,105,114,108,112,111,111,108,0,0,0,0,0,0,0,11,101,120,97,109,112,108,101,46,99,111,109,0,2,102,1,1,0,0,0,0,0,0,0,12,102,97,99,101,98,111,111,107,46,99,111,109,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,8,0,0,0,0,0,0,0,4,83,72,65,49,0,0,0,0,0,0,0,12,102,97,99,101,98,111,111,107,46,99,111,109,0,1,0,0,0,0,0,0,0,9,103,109,97,105,108,46,99,111,109,1,0,0,0,0,0,0]

Cyan4973 commented 11 years ago

The bytestream compression/decoding works successfully on the provided test suite using the C version.

cartazio commented 11 years ago

@Cyan4973 thanks, thats helpful! I'm going to stare at the haskell decompress wrapper (which has a lot going on) and see if theres something wrong there. https://github.com/mwotton/lz4hs/blob/master/src/Codec/Compression/LZ4.hsc#L116

cartazio commented 11 years ago

I modified decompress as so:

-- | Decompress the input 'ByteString'.
decompress :: S.ByteString -> Maybe S.ByteString
decompress xs
  | S.null xs = Just S.empty
  | otherwise =
      -- Get the length of the uncompressed buffer and do our thing
      either (const Nothing) (unsafePerformIO . go) $ runGet unformat xs
  where go (l, str) =
          U.unsafeUseAsCString str $ \cstr -> do
            out <- SI.createAndTrim l $ \p -> do
                  r <- fromIntegral <$> c_LZ4_uncompress cstr p (fromIntegral l)
                  return $! if l > r then (error $ "uncompressed size differs from expected\n " ++ show str ++ "\n with headers was\n" 
                            ++ show xs) else  if (r <= 0) then 0 else r
            return $! if (S.null out) then Nothing else (Just out)
{-# INLINEABLE decompress #-}

the key bit is checking if l > r , because we should have that l=r in this case because the compress method records the size of the original string in the "header" of our little format.

the show instance for the offending string here give me the message 1) regression test can compress an oddly full-of-NULLs string FAILED uncompressed size differs from expected "R\STXd\STX\SOH\NUL\SOH\NUL\196\vexample.com\DC4\NUL\DC2\ETX\ESC\NUL#\NUL\SI\b\NUL\163\tWhirlpool\DC1\NUL\b5\NULD\NUL\STXf\SOH9\NUL\152\ffacebookN\NUL\EOT\GS\NUL\DC3\b=\NULS\EOTSHA1\f\NUL\t1\NUL\ENQ\NUL\240\STX\tgmail.com\SOH\NUL\NUL\NUL\NUL\NUL\NUL" with headers was "\176\NUL\NUL\NULk\NUL\NUL\NULR\STXd\STX\SOH\NUL\SOH\NUL\196\vexample.com\DC4\NUL\DC2\ETX\ESC\NUL#\NUL\SI\b\NUL\163\tWhirlpool\DC1\NUL\b5\NULD\NUL\STXf\SOH9\NUL\152\ffacebookN\NUL\EOT\GS\NUL\DC3\b=\NULS\EOTSHA1\f\NUL\t1\NUL\ENQ\NUL\240\STX\tgmail.com\SOH\NUL\NUL\NUL\NUL\NUL\NUL"

NOTE this is the compressed string (the with and without headers version...), and the header from compression was "\176\NUL\NUL\NULk\NUL\NUL\NUL"

i'm off for the evening, but maybe this helps give everyone some breadcrumbs

cartazio commented 11 years ago

SOLVED IT

we already know the uncompressed length l, we should be ignoring the r value returned by the uncompress_ffi function.

this definition of uncompress passes the test suite:

-- | Decompress the input 'ByteString'.
decompress :: S.ByteString -> Maybe S.ByteString
decompress xs
  | S.null xs = Just S.empty
  | otherwise =
      -- Get the length of the uncompressed buffer and do our thing
      either (const Nothing) (unsafePerformIO . go) $ runGet unformat xs
  where go (l, str) =
          U.unsafeUseAsCString str $ \cstr -> do
            out <- SI.createAndTrim l $ \p -> do
                  r <- fromIntegral <$> c_LZ4_uncompress cstr p (fromIntegral l)
                  return $! {-if l > r then (error $ "uncompressed size differs from expected\n " ++ show str ++ "\n with headers was\n" 
                            ++ show xs) else -} if (r <= 0) then 0 else  l--  r
            return $! if (S.null out) then Nothing else (Just out)
{-# INLINEABLE decompress #-}

i'll open a pull request so we can discussed this better tomorrow.

thoughtpolice commented 11 years ago

Damn, that bug is obvious in retrospect! The fix of returning l instead of r is indeed completely correct I think.

Good job!

cartazio commented 11 years ago

@thoughtpolice so now the question is why c_lz4_uncompress is returning the wrong Int!

thanks!

Cyan4973 commented 11 years ago

now the question is why c_lz4_uncompress is returning the wrong Int!

The C version of LZ4_uncompress returns the correct int.

we should have that l=r

Not sure what "l" and "r" are, but just in case, here is a short reminder of LZ4_uncompress definition :

(straight from the source) int LZ4_uncompress (const char* source, char* dest, int osize); return : the number of bytes read in the source buffer (in other words, the compressed size)

So, if "l" is the original size, and "r" is the size returned by the function, they cannot be equal.

cartazio commented 11 years ago

huh, so the bindings should never used the result of LZ4_uncompress in the first place aside from checking for an error condition.

the bindings were using that result int as the count of the number of bytes used in the decompressed result array.

glad thats been exorcised!

mwotton commented 11 years ago

me too. pull-request me up, bitte?

cartazio commented 11 years ago

@mwotton consider yourself pull requested

mwotton commented 11 years ago

merged and released on hackage as 0.2.1. cheers everyone.