oxinabox / DataDeps.jl

reproducible data setup for reproducible science
Other
151 stars 43 forks source link

7zip changes the file names when unpacking #134

Open Djoop opened 3 years ago

Djoop commented 3 years ago

I have some code using the unpack function which fails with DataDeps 7.7 (apparently there were some changes to use 7zip on all platforms, not sure when exactly the breaking change happened). I have a wrapper for the following dataset: http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz (not sure if the file has something to do or if this is a generic error), which contains a file called "kddcup.data_10_percent" (as can be seen e.g. using gunzip -l …), yet unpack creates a file called kddcup.data_10_percent_corrected (with some other files it ended up in .corrected).

Unpacking runs without an error, however it is inconvenient as I was expecting it to respect the file names (and this was the behavior with previous versions of DataDeps). Or is there a special function to use in order to obtain the path of unpacked files?

oxinabox commented 3 years ago

Weird. I have never seen that happen before. I think is is an upstream bug in 7zip. Can you see if you can reproduce with 7zip alone?

As a work around you can add to the registration block:

post_fetch_method = compressed_filename -> run(`gunzip -l ...`)
Djoop commented 3 years ago

Indeed, it seems to be an upstream bug (actually, I don't know if the bug is from 7zip or from gunzip…). Here is what I get with the 7zip packed with my distribution, the same archive yields two different file names with gunzip and 7z:

$ 7z l kddcup.data_10_percent.gz

7-Zip [64] 17.03 : Copyright (c) 1999-2020 Igor Pavlov : 2017-08-28
p7zip Version 17.03 (locale=fr_FR.UTF-8,Utf16=on,HugeFiles=on,64 bits,12 CPUs x64)

Scanning the drive for archives:
1 file, 2144903 bytes (2095 KiB)

Listing archive: kddcup.data_10_percent.gz

--
Path = kddcup.data_10_percent.gz
Type = gzip
Headers Size = 43

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2007-06-08 04:35:37 .....     74889749      2144903  kddcup.data_10_percent_corrected
------------------- ----- ------------ ------------  ------------------------
2007-06-08 04:35:37           74889749      2144903  1 files
------------------------------------------------------------------------------------------------

$ gunzip -l kddcup.data_10_percent.gz
         compressed        uncompressed  ratio uncompressed_name
            2144903            74889749  97.1% kddcup.data_10_percent

I don't know if there is anything special with this archive as I did not create it, yet this is surprising. Thanks for the workaround, I guess it works only if there is a single file in the archive?

oxinabox commented 3 years ago

Thanks for the workaround, I guess it works only if there is a single file in the archive?

Well you can run what ever you want. E.g. tar -xzf ... will do gzipped tarballs.