snoyberg / tar-conduit

Conduit based tar extraction mechanism
MIT License
8 stars 9 forks source link

Bug with untar where sometimes pax headers are not parsed properly #37

Closed bitc closed 7 months ago

bitc commented 7 months ago

Hello, I am trying out the new pax support added in #34

Thank you, it is a great feature and very much appreciated.

I am using the untar function to extract a tar file that contains long pathnames (path=my-long-path pax records).

It works 99% well but occasionally fails to parse the pax header for some files and ends up ignoring the long pathname.

I haven't fully diagnosed this, but the problem seems to be in the parsePax function.

https://github.com/snoyberg/tar-conduit/blob/96b18aadf50c75d2f599b383c35241bc2abb8478/src/Data/Conduit/Tar.hs#L496-L499

This function reads chunks containing a ByteString payload, and then calls paxParser with this ByteString value.

The problem is that sometimes this ByteString value is truncated, for example:

"70 path=very/very/very/very/long/file/path/that/got/tru"

The "70" indicates the length and you can see it got truncated and there is also no trailing "\n" char.

Immediately afterwards my traces show that there is a ChunkPayload containing the rest of the header ("ncated.txt\n").

When paxParser is called with the truncated header, the parse fails and the header is ignored, causing errors with the file.

So this seems to be an error related to the chunking that is done in conduit, and happens on the rare occasion when a pax header happens to cross a ByteString chunk boundary. When I change the conduit input to be a network stream instead of a file, then I think the input chunking is a bit different, and I get errors in a different pax header entry.

bitc commented 7 months ago

I analysed this further and it was as I suspected above. The bug occurs when the incoming flow of ByteStrings through the conduit are short, and we get lots of small ChunkPayloads instead of one contiguous one.

Here is the "untar, pax interchange format" test that I played around with:

https://github.com/snoyberg/tar-conduit/blob/96b18aadf50c75d2f599b383c35241bc2abb8478/tests/Spec.hs#L313-L318

I can make this test fail by "chopping" the conduit input bytes such that they flow in a single byte at a time:

     it "untar, pax interchange format" $ do
         res <- runConduitRes $
                paxExample
+            .| chop
             .| untar process
             .| sinkList
         pure res `shouldReturn` [("filepath", "payload")]

Here is the "chop" function:

chop :: (MonadIO m) => ConduitT ByteString ByteString m ()
chop = do
  next <- await
  case next of
    Nothing -> pure ()
    Just val -> do
      forM_ (splitEvery 1 (S.unpack val)) $ \chunk -> do
        yield (S.pack chunk)
      chop

splitEvery :: Int -> [a] -> [[a]]
splitEvery n = P.takeWhile (not . P.null) . P.map (P.take n) . P.iterate (P.drop n)

With this modification, the test fails with:

pax
  untar, pax interchange format [✘]

Failures:

  tests/Spec.hs:320:5: 
  1) pax untar, pax interchange format
       uncaught exception: TarException
       UnexpectedPayload 513

I have a fix, I will post it here shortly.

bitc commented 7 months ago

Here is the fix I came up with. parsePax should read all of the ChunkPayloads that are available in the input (instead of just the first). It should concatenate all of their payloads together and then call paxParser on the result.

-parsePax :: Monad m => ConduitM TarChunk TarChunk (StateT PaxState m) PaxHeader
-parsePax = await >>= \case
-    Just (ChunkPayload _ b) -> pure $ paxParser b
-    _ -> pure mempty
+parsePax :: Monad m => ConduitM TarChunk TarChunk (StateT PaxState m) PaxHeader
+parsePax = do
+    payloads <- readAllChunkPayloads []
+    let b = S.concat (reverse payloads)
+    pure $ paxParser b
+
+readAllChunkPayloads :: Monad m => [ByteString] -> ConduitT TarChunk o m [ByteString]
+readAllChunkPayloads accum = do
+    next <- await
+    case next of
+        Nothing -> pure accum
+        Just (ChunkPayload _ b) -> readAllChunkPayloads (b:accum)
+        Just other -> do
+            leftover other
+            pure accum

Someone who is better at conduit than me can probably write this in a more elegant way that is more efficient.

This fixes the original issue I had.

It also fixes the test crash with the "chop" modification I described above in https://github.com/snoyberg/tar-conduit/issues/37#issuecomment-1877639073 (note that the test will still fail, but this time for a different reason -- a flaw in the test assertion can that be easily fixed).

mpilgrem commented 7 months ago

@bitc, I am sorry that #34 was not bug-free. I'll work through your analysis but I'm likely not in a position to do that until the coming weekend.

mpilgrem commented 7 months ago

@bitc, I agree fully with your analysis and I have a pull request that reflects it and your solution: #39. My original mistake was not realising that pax extended header data could be provided in more than one sequential chunk.

The only thing I changed was that it seemed to me that the accumulator could be of type ByteString, rather than going via a list of ByteString and S.concat. I am assuming that in the great majority of cases, in practice, the pax extended header data would be provided in only a small number of sequential chunks.

bitc commented 7 months ago

Thank you! The fix in #39 looks perfect and I have tested it in my code and the problem is gone :+1:

Thank you again for this feature, pax support in tar-conduit is very useful and valuable :100: