libarchive / libarchive

Multi-format archive and compression library
http://www.libarchive.org
Other
3.05k stars 771 forks source link

Executable bit in ZIP-archives get thrown away when reading from stdin. #1106

Closed dreirund closed 5 years ago

dreirund commented 5 years ago

I encountered that the command bsdtar from the package libarchive (under Arch Linux, at least) does throw away executable bits of files in .zip-archives when reading from stdin, but not when directly working on the file.

On .tar-archives it preserves the executable bit also when reading from stdin.

bsdtar --version: bsdtar 3.3.3 - libarchive 3.3.3 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6 liblz4/1.8.2 libzstd/1.3.5.

Test case:

I made a test archive http://felics.kettenbruch.de/files/archive_executable_bit_test/archive_exevutable_bit_test.zip which contains two files in a subdirectory. One file is executable, the other not.

Extracting directly:

wget -q -O archive_exevutable_bit_test.zip http://felics.kettenbruch.de/files/archive_executable_bit_test/archive_exevutable_bit_test.zip
bsdtar -x -f archive_exevutable_bit_test.zip
ls -nl archive_exevutable_bit_test/*

shows

-rwxr-xr-x 1 1001 1001 35 Dec 11 13:38 archive_exevutable_bit_test/executable.sh
-rw-r--r-- 1 1001 1001 33 Dec 11 13:39 archive_exevutable_bit_test/non-executable.txt

The executable bit for executable.sh is present here.

Reading from stdin:

wget -q -O - http://felics.kettenbruch.de/files/archive_executable_bit_test/archive_exevutable_bit_test.zip | bsdtar -x -f -
ls -nl archive_exevutable_bit_test/*

shows

-rw-r--r-- 1 1001 1001 35 Dec 11 13:38 archive_exevutable_bit_test/executable.sh
-rw-r--r-- 1 1001 1001 33 Dec 11 13:39 archive_exevutable_bit_test/non-executable.txt

The executable bit for executable.sh is thrown away here.

.tar-archive:

As a comparison, for a .tar-archive, the executable bit in the archive is also honoured works also when reading from stdin:

wget -q -O - http://felics.kettenbruch.de/files/archive_executable_bit_test/archive_exevutable_bit_test.tar | bsdtar -x -f -
ls -nl archive_exevutable_bit_test/*

shows

-rwxr-xr-x 1 1001 1001 35 Dec 11 13:38 archive_exevutable_bit_test/executable.sh
-rw-r--r-- 1 1001 1001 33 Dec 11 13:39 archive_exevutable_bit_test/non-executable.txt

Expected behavious:

jsonn commented 5 years ago

Zip archives contains two different ways to describe the content: (1) A per-entry header (2) A central directory at the end of the zip file. libarchive (and bsdtar by extension) will use the central directory if seeking is possible on the input, otherwise it will fall back to the streaming-only logic. The entries are not necessarily consistent as you found out in your test case. There isn't really much we can or want to do about this. Note that you can replace wget with a plain cat and it will still show the same behavior.

The short version is that this is an inherent issue with streaming of zip files and something that won't be fixed.

dreirund commented 5 years ago

According to http://unix.stackexchange.com/questions/487338#487371, this also happens if bsdtar itself created the ZIP archive. Shouldn't at least libarchive then create consistent meta-information (per-entry header and central directory having consistent information), so that archives created by libarchive are extracted correctly by libarchive? Maybe this then a bug in libarchive, that it creates ZIP archives with inconsistent information?

Is there any standard to ZIP which information (per-entry header or central directory) is more to trust?

jsonn commented 5 years ago

bsdtar doesn't create the extension by default, it can be requested with --options zip:experimental.

dreirund commented 5 years ago

Why does libarchive's bsdtar throw an error on ISO files if it encounters such inconsistencies as it encounters in ZIP files but cannot seek the file? Shouldn't it then also throw an error in ZIP files, too?

jsonn commented 5 years ago

Because ISO files are not streamable in most situations in a meaningful way. File attributes on the other hand are often enough absend in zip files.

kientzle commented 5 years ago

As Joerg pointed out, there are basic limitations with some of the formats we deal with:

As a workaround, libarchive's Zip support includes an experimental extension (developed in conjunction with the Info-Zip maintainers) that puts more complete metadata with each entry. I hope to enable this by default at some point.

In theory, the streaming Zip reader could read the full metadata when it does get to the end and update all the files. This would require some careful rework of the Zip reader and probably changes to the logic that writes files to disk. In essence, every file would get "written to disk" twice: Once with full data and partial metadata, again with full metadata and no data.

kientzle commented 5 years ago

Is there any standard to ZIP which information (per-entry header or central directory) is more to trust?

The Zip standard is here: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

If you study this carefully, you'll notice that the file permissions are only stored in the central directory. All other metadata should be the same. The zip:experimental adds an extension to the per-entry header which duplicates the file permissions that are present in the central directory.

epicfaace commented 3 years ago
  • Libarchive's Zip reader will seek to obtain full metadata if it can; otherwise it will use the partial metadata.

@kientzle quick question -- when libarchive is streaming a .zip file and just using the partial metadata, how does it deal with the possibility mentioned here that some files could not be actually listed in the central directory and thus should not be extracted, as well as the possibility that there is extra data between file chunks / before the first file chunk? Does it just assume that the zip file isn't in these special cases, or does it try to read the central directory at the end to somehow correct what has already been extracted?

kientzle commented 3 years ago

In theory, libarchive could stream Zip archives by extracting all the entries, then reading the central directory and using that information to edit the data on disk. It does not currently do this. As a result, it cannot fully handle some of the pathological cases you describe while performing a streaming extraction.

Libarchive does have error-recovery logic that can to a limited extent deal with garbage data appearing in the archive (between entries or before the first entry). You can see the details starting around line 3146 of the read_header function here: https://github.com/libarchive/libarchive/blob/5bb998d869979140156bce59c0ff8f9063a25581/libarchive/archive_read_support_format_zip.c#L3102