snoyberg / tar-conduit

Conduit based tar extraction mechanism
MIT License
8 stars 9 forks source link

Fix headerFilePath function #12

Closed sergv closed 6 years ago

sergv commented 6 years ago

I believe that headerFilePath function is currently implemented incorrectly. That is, it should not just concatenate prefix with suffix, but rather it should combine them via /, according to https://github.com/Keruspe/tar-parser.rs/blob/master/tar.specs#L187.

Sure, that may be a random spec on the internet but here's a test case that works OK with GNU tar but not-so-OK with simple program using tar-conduit. Take the https://hackage.haskell.org/package/cabal-debian-4.24.8/cabal-debian-4.24.8.tar.gz archive and compile the following program:

{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE RankNTypes          #-}

module FastTagsBaselinesDebug (main) where

import Control.Monad.Catch (MonadMask)
import Control.Monad.Trans
import Control.Monad.Trans.Resource

import qualified Data.ByteString as BS
import Data.Foldable
import System.Environment
import System.IO as IO

import Data.Conduit
import qualified Data.Conduit.Binary as Conduit
import qualified Data.Conduit.Tar as Conduit
import qualified Data.Conduit.Zlib as Conduit

withBinaryFileResT
  :: MonadResource m
  => FilePath -> IOMode -> (Handle -> m a) -> m a
withBinaryFileResT path mode f = do
  (releaseKey, x) <- allocate (IO.openBinaryFile path mode) IO.hClose
  f x <* release releaseKey

processTar
  :: forall m. (MonadIO m, MonadResource m, MonadMask m)
  => FilePath -> m ()
processTar archiveFile =
  withBinaryFileResT archiveFile ReadMode $ \h ->
    Conduit.sourceHandle h =$=
    Conduit.ungzip =$=
    Conduit.untar $$ Conduit.withEntries handleEntry
  where
    handleEntry :: Conduit.Header -> Consumer BS.ByteString m ()
    handleEntry hdr =
      liftIO $ putStrLn $ show (Conduit.headerFileType hdr) ++ " " ++ Conduit.headerFilePath hdr

main :: IO ()
main = traverse_ (runResourceT . processTar) =<< getArgs

Its output (notice the contents of test-data folder):

$ test cabal-debian-4.24.8.tar.gz  | awk '/a1cb9e4b5241944a3da44e00220b5c31\/creativeprompts.com/'
FTNormal cabal-debian-4.24.8/test-datacreativeprompts/output/debian/cabalInstall/a1cb9e4b5241944a3da44e00220b5c31/creativeprompts.com

However, with tar everything's good:

$ tar tf /home/sergey/projects/haskell/hackage/archives/cabal-debian-4.24.8.tar.gz | awk '/a1cb9e4b5241944a3da44e00220b5c31\/creativeprompts.com/'
cabal-debian-4.24.8/test-data/creativeprompts/output/debian/cabalInstall/a1cb9e4b5241944a3da44e00220b5c31/creativeprompts.com

On a completely unrelated matter, I also noticed that header parsing code could be improved. Please voice your concerns if you believe it should go into separate PR.

snoyberg commented 6 years ago

There's currently a major PR in progress. Can you check if the problem is already addressed by https://github.com/snoyberg/tar-conduit/pull/11? I believe it totally overhauled the headerFilePath bit.

snoyberg commented 6 years ago

Ignore my previous comment, @lehins mentioned this PR to me, and said he would include the fix in his PR.

sergv commented 6 years ago

I think we can close this PR then. But I hope a test for this issue will be added to #11 since this kind of tar file seems to be rather rare. I'll move remaining optimization commit to another PR.

snoyberg commented 6 years ago

OK, closing. I've just merged the other PR to master, so if you'd like to rebase the optimizations onto it, now's a good time. Sorry for the confusion, long-lived PRs do this sometimes.