tsolomko / SWCompression

A Swift framework for working with compression, archives and containers.
MIT License
233 stars 39 forks source link

Symlinks in tarballs are created with absolute paths #37

Closed kalkwarf closed 1 year ago

kalkwarf commented 1 year ago

Extracting a tarball containing symlinks results in symlinks with absolute paths. For example:

lrwxr-xr-x   1 me  staff    44B Oct 20 13:52 linked.zip@ -> /Users/me/symlink-test/subdirectory/file.zip
drwxr-xr-x   3 me  staff    96B Oct 20 13:52 subdirectory/

This makes the directory non-portable, as relocating symlink-test will break the target path.

The solution for this is to see if the source and destination share a prefix, and if so, rewrite the destination to be relative. This will give a relative link, like so:

lrwxr-xr-x   1 me  staff    21B Oct 20 14:40 linked.zip@ -> subdirectory/file.zip
drwxr-xr-x   4 me  staff   128B Oct 20 13:55 subdirectory/

This also matches the results when using the Mac's Archive Utility to extract the files.

I have a fix coded up, but need to write some tests before I can open a PR.

tsolomko commented 1 year ago

Hi,

I am a bit confused about what kind of an issue you are describing here. SWCompression APIs are not aware of such concepts as relative or absolute paths. They just faithfully the contents of the supplied archive in a convenient form (this includes TarEntryInfo.linkName), and it's up to the user to interpret these values as either kind of a path.

I suppose you're talking about the usage of these APIs in swcomp, specifically, this line here. If that's the case, then I should emphasize once again that swcomp command-line tool is not intended for general use. It serves only two purposes: demonstration of how SWCompression APIs can be used (and not how they should be used), and as an "internal" testing facility. To illustrate this, note that "hard" links are completely ignored in swcomp and rejected as an "unknown" entry type.

On the subject of the actual issue, I do not know what behavior (the current one with absolute paths or the proposed with relative ones) is more correct. I've skimmed through the various TAR-related material listed in the README to refresh my knowledge on the subject matter and to my understanding the absolute/relative path issue is not specified.

I am interested in any other references that may explicitly specify the proper way of handling links, and no, the Mac's Archive Utility's current behavior does not count, as it may change at any point and I do not fully trust them with the correctness of their implementation.

kalkwarf commented 1 year ago

Sorry, yes, I was referring to swcomp as I was using it as a reference while working on my own project.

The archive that started me down this path is at: https://github.com/Homebrew/brew/tarball/master

Looking at the GitHub repo, I can see that the original symlinks were relative: https://github.com/Homebrew/brew/blob/master/Library/Homebrew/test/support/fixtures/bottles/testball_bottle-0.1.x86_64_linux.bottle.tar.gz

but upon extraction with swcomp, they are written as absolute links.

Dumping the TAR's table of contents, I can see the destination was recorded as relative:

lrwxrwxrwx  0 root   root        0 Sep 20 12:36 Homebrew-brew-a6aab4b/Library/Homebrew/test/support/fixtures/bottles/testball_bottle-0.1.aarch64_linux.bottle.tar.gz -> testball_bottle-0.1.yosemite.bottle.tar.gz

While I can't find any documentation that discusses this, it seems like relative is defined in the archive itself. 🤷

tsolomko commented 1 year ago

After thinking a bit more about this and some experimentation, I am inclined to agree that the current behavior is not ideal. It has been fixed in f191db24948393143f5d62b860475aa708bb02e2.

While working on this I've uncovered a couple of more issues with the current TAR implementation:

  1. The PAX extended header records values that contain newline characters were parsed incorrectly resulting in an error being thrown.
  2. The PAX extended header records values that are not UTF-8 strings were also causing an error to be thrown. The existence of such headers is quite surprising considering that one of the PAX specifications specifically says that "the value field shall be encoded using UTF-8". Apparently, the Apple-supplied TAR implementation on macOS does not respect this and just writes binary data in these records.

All of these were fixed in 4.8.3.

In addition, I have also discovered that the Apple-supplied TAR implementation on macOS actually reverses the direction of hard links. Basically, if you have a "link" hardlink that links to a "file" then in the resulting TAR archive it will be the other way around, "file" will be a hardlink to a "link"...

While it may be possible that I am wrong and I am missing something here, but generally speaking this is why I do not consider Apple's Archive Utility as a reference implementation (GNU Tar behaves the expected way).

P.S. The last two issues I decided to report to Apple as FB11712450 and FB11712441.