Closed dreirund closed 5 years ago
Zip archives contains two different ways to describe the content: (1) A per-entry header (2) A central directory at the end of the zip file. libarchive (and bsdtar by extension) will use the central directory if seeking is possible on the input, otherwise it will fall back to the streaming-only logic. The entries are not necessarily consistent as you found out in your test case. There isn't really much we can or want to do about this. Note that you can replace wget with a plain cat and it will still show the same behavior.
The short version is that this is an inherent issue with streaming of zip files and something that won't be fixed.
According to http://unix.stackexchange.com/questions/487338#487371, this also happens if bsdtar
itself created the ZIP archive. Shouldn't at least libarchive
then create consistent meta-information (per-entry header and central directory having consistent information), so that archives created by libarchive
are extracted correctly by libarchive
? Maybe this then a bug in libarchive
, that it creates ZIP archives with inconsistent information?
Is there any standard to ZIP which information (per-entry header or central directory) is more to trust?
bsdtar
doesn't create the extension by default, it can be requested with --options zip:experimental
.
Because ISO files are not streamable in most situations in a meaningful way. File attributes on the other hand are often enough absend in zip files.
As Joerg pointed out, there are basic limitations with some of the formats we deal with:
As a workaround, libarchive's Zip support includes an experimental extension (developed in conjunction with the Info-Zip maintainers) that puts more complete metadata with each entry. I hope to enable this by default at some point.
In theory, the streaming Zip reader could read the full metadata when it does get to the end and update all the files. This would require some careful rework of the Zip reader and probably changes to the logic that writes files to disk. In essence, every file would get "written to disk" twice: Once with full data and partial metadata, again with full metadata and no data.
Is there any standard to ZIP which information (per-entry header or central directory) is more to trust?
The Zip standard is here: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
If you study this carefully, you'll notice that the file permissions are only stored in the central directory. All other metadata should be the same. The zip:experimental
adds an extension to the per-entry header which duplicates the file permissions that are present in the central directory.
- Libarchive's Zip reader will seek to obtain full metadata if it can; otherwise it will use the partial metadata.
@kientzle quick question -- when libarchive is streaming a .zip file and just using the partial metadata, how does it deal with the possibility mentioned here that some files could not be actually listed in the central directory and thus should not be extracted, as well as the possibility that there is extra data between file chunks / before the first file chunk? Does it just assume that the zip file isn't in these special cases, or does it try to read the central directory at the end to somehow correct what has already been extracted?
In theory, libarchive could stream Zip archives by extracting all the entries, then reading the central directory and using that information to edit the data on disk. It does not currently do this. As a result, it cannot fully handle some of the pathological cases you describe while performing a streaming extraction.
Libarchive does have error-recovery logic that can to a limited extent deal with garbage data appearing in the archive (between entries or before the first entry). You can see the details starting around line 3146 of the read_header
function here: https://github.com/libarchive/libarchive/blob/5bb998d869979140156bce59c0ff8f9063a25581/libarchive/archive_read_support_format_zip.c#L3102
I encountered that the command
bsdtar
from the packagelibarchive
(under Arch Linux, at least) does throw away executable bits of files in.zip
-archives when reading fromstdin
, but not when directly working on the file.On
.tar
-archives it preserves the executable bit also when reading from stdin.bsdtar --version
:bsdtar 3.3.3 - libarchive 3.3.3 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6 liblz4/1.8.2 libzstd/1.3.5
.Test case:
I made a test archive
http://felics.kettenbruch.de/files/archive_executable_bit_test/archive_exevutable_bit_test.zip
which contains two files in a subdirectory. One file is executable, the other not.Extracting directly:
shows
The executable bit for
executable.sh
is present here.Reading from
stdin
:shows
The executable bit for
executable.sh
is thrown away here..tar
-archive:As a comparison, for a
.tar
-archive, the executable bit in the archive is also honoured works also when reading fromstdin
:shows
Expected behavious: