edsko commented 8 years ago

Happened to see this, don't know if it actually matters or not. But StringTable.construct calls ByteString.Char8.pack, which throws away a lot of information. Paths with unicode characters will probably break?

mrkkrp commented 8 years ago

Yes, they do break! When package is created, entires (or headers, I don't know proper terminology) look like gibberish. In fact, tar doesn't think it's a proper tar archive at all. I've spent about an hour looking where my app is doing something wrong, but it turns out that the library has bugs.

Here is an example:

~/Downloads $ tar -xvf foo.tar
/usr/bin/tar: This does not look like a tar archive
/usr/bin/tar: Skipping to next header
/usr/bin/tar: Exiting with failure status due to previous errors

And with help of Emacs I can see:

-rw-r--r--       0/0       82492644 01 >65 E@0=8 :>@>;O!.flac

Where is something unprintable at all. This must be fixed ASAP.

mrkkrp commented 8 years ago

@dcoutts, Is PR disarible or you can fix it yourself?

dcoutts commented 8 years ago

@mrkkrp this isn't a new problem right? It's never done unicode.

Yes, a comprehensive fix would be welcome, but this isn't easy. It still has to work with arbitrary unix files which are not necessarily unicode.

mrkkrp commented 8 years ago

@dcoutts, I didn't know it's not supposed to work with Unicode. But well, it's 2016, Unicode is everywhere. And there are a lot of coutries that use non-Latin scripts, so once you choose to work with tar archives in Haskell and you have to deal with non-Latin script, you have this problem.

Oh, OK. Can you describe why exactly Unicode is so hard? All the tools for ByteString ecoding/decoding available, and UTF-8 is the same as ASCII if it doesn't contain Unicode characters.

Also, where to look if I want to properly fix this? (I now either need to fix it or call extenral tar application instead, which is not very pretty.)

mrkkrp commented 8 years ago

Can't we just use utf8-string for example and replace some calls to pack/unpack from Data.ByteString.Char8 with calls to fromString / toString from Data.ByteString.UTF8? That should work for file paths without Unicode characters as well as for those with Unicode characters in them. Am I missing something important here?

edsko commented 8 years ago

How do you know the paths are UTF8 encoded, and not something else?

mrkkrp commented 8 years ago

I don't see any problems here. We're talking about FilePath, which is a synonym for String, list of Chars. Every char is not a byte, but something that already can represent any Unicode value.

Now if we take just UTF-8, it's designed to be backward compatible with ASCII. This means that ByteString representing UTF-8-encoded string is the same as ByteString representing ASCII string (one byte per character, this how it currently works, as I understand). So, no regression will happen if we switch, with respect to this limited collection of characters, things will be all the same.

As Wikipedia puts it:

Backward compatibility: One-byte codes are used only for the ASCII values 0 through 127. In this case the UTF-8 code has the same value as the ASCII code. The high-order bit of these codes is always 0. This means that ASCII text is valid UTF-8, and UTF-8 can be used for parsers expecting 8-bit extended ASCII even if they are not designed for UTF-8.

For non-ASCII characters however, it's not possible to represent them using only one byte per character, so there will be difference and Unicode paths will be represented by longer ByteStrings, but I don't see any problem here either, just put that sequence of bytes into that string table and extract them afterwards decoding them as UTF-8 strings.

I see the following problems, however:

I don't know if there is any standard with respect to encoding that should be used. I mean, is this OS-dependent? Linux uses UTF-8 everywhere, but Windows does not. How tar application should know how to interpret file names? I don't know. I guess if we go with UTF-8 it will be much better than truncated characters anyway.
As I understand from quick reading of source code, file paths are limited in length. If this is a limitation from tar format specification, then Unicode paths that can be put into a tar archive will be shorter than non-Unicode ones.

Anyway, this change is a must, because otherwise use of this library is very limited (as Haskell community's tool to be used as part of applications used mostly by programmers).

Even if you deal with Latin alphabet only, there are various characters that can be in paths, like quotes: “” (note, they are different from "", which cannot be in paths in Windows, although they are in ASCII range, but “” on the other hand can, and they are proper punctuation to be used anyway), there are copyright signs ©, a lot of punctuation that is not in ASCII range.

I can imagine you don't use these things in names of source files, but this doesn't mean other (possibly non-technical) people don't put Unicode in names of files, and they may be direct users of some Haskell program that uses this library.

mrkkrp commented 8 years ago

Hmm, TAR officially doesn't support non ASCII characters. Too bad, but I think I saw tar-archives that contain paths with Unicode in them. Strange, I'll need to read more about workaround and how it's generally done.

mrkkrp commented 8 years ago

Anyway since tar specification specifies ASCII range explicitely and UTF-8 and ASCII are the same in that range, I think that idea with UTF-8 should be perfectly OK.

mrkkrp commented 8 years ago

I'm waiting for @dcoutts opinion. Perhaps I should just use more-modern archive format. This is unbelievable that it doesn't support anything but ASCII, what a flaw…

So if it's specification that's broken, then I suggest we close the issue, because this library implements the specification well. I'll just switch to zip, it will be also more familiar for my non-techy users. Sorry for prolonged disscussion.

dcoutts commented 8 years ago

I'm not opposed to following whatever convention other tar impls use when it comes to unicode. But note that it isn't a trivial matter of sticking in a few to/fromUTF8 calls (remember that not all unix files are unicode but all windows/osx ones are). See for example https://docs.python.org/2/library/tarfile.html#tar-unicode

I think a good time to tackle this problem is when we add pax support (isssue #1). The posix pax standard explicitly supports file name encodings, and utf8 in particular.

ezyang commented 8 years ago

I'm surprised by the discussion here. There is a very simple solution which is unambiguously the right thing to do: use withFilePath from System.Posix.Internals (in base) to encode a FilePath into the OS-specific encoding, and then blast that straight into the tarball. The point is that people expect tar to work like how an invocation of the tar program on the filesystem would work, and the convention is that you just preserve the raw encoding of the data directly.

EDIT: OK, I'll retract this. If you followed my suggestion, then if you used tar on Windows, all of the files would be blasted into the tarball using UTF-16 encoding. Which will totally do the right thing on Windows (Unicode will be supported properly) and also totally miss the point, if you were hoping to pass the tarball on to someone else. Ouch.

23Skidoo commented 8 years ago

Can't we make this case an error instead of silently accepting? Current behaviour causes problems for users:

https://github.com/haskell/cabal/issues/3758 https://github.com/commercialhaskell/stack/pull/2557

ezyang commented 8 years ago

I support erroring. The truncation from Char8.pack is basically never right, IMO.

ezyang commented 8 years ago

Also, is there an interface for passing tar direct ByteString encodings of the desired file paths? This would at least let end users make a decision what encoding they want.

hasufell commented 4 years ago

What's the status of this? The current implementation is breaking filenames. All filepaths should be ByteString (aka RawFilePath). This is a low-level library, if someone wants to add a String or Text interface on top, that's fine.

EDIT: afais gnu tar specifies:

The name, linkname, magic, uname, and gname are null-terminated character strings. All other fields are zero-filled octal numbers in ASCII.

But this probably isn't portable for Mac OS and windows...

EDIT2: I think I'll create a tar-bytestring fork that is specifically targeted for POSIX platforms. At least that fixes half of the problem.

EDIT3: https://hackage.haskell.org/package/tar-bytestring

hasufell commented 3 years ago

This is what tar-conduit does: https://github.com/snoyberg/tar-conduit/blob/81283887aaa9771c0f2db53cb4e86700da4c2d9e/src/Data/Conduit/Tar/Types.hs#L151

It encodes and decodes as UTF-8. I'd say that's a pretty good bet. For unpacking, we could provide a version that allows to set the encoding... or we make use of something like https://hackage.haskell.org/package/charsetdetect-ae

Bodigrim commented 11 months ago

I pushed 423e6af, prohibiting non-ASCII file names. At the very least, we should not silently corrupt Unicode data. A stategic solution would be to migrate to PosixPath and leave encoding questions to clients.

hasufell commented 11 months ago

A stategic solution would be to migrate to PosixPath and leave encoding questions to clients.

There are some non-trivial parts there, because although the tar spec demands unix semantics, the library also works on windows (see toTarPath). Since we use the FilePath representation currently, we don't have to convert the filenames between the platforms (just the separators are changed). With OsPath, it seems we would need a way to convert between PosixPath and WindowsPath. So we kinda have to assume utf8 here too at least on windows?

Bodigrim commented 11 months ago

Yes, I'd assume UTF-8 on Windows.

mpilgrem commented 10 months ago

Would it be possible for Codec.Archive.Tar.Entry to export the data constructor of TarPath?

I've written something for Stack that works around fromTarPath using BS.Char8.unpack (Stack needs that to be (T.unpack . T.decodeUtf8Lenient)), but the code needs access to the data constructor.

EDIT: In the interim, I've realised I can convert the FilePath back into a ByteString, and start again:

fromTarPath :: TarPath -> FilePath
fromTarPath = T.unpack . T.decodeUtf8Lenient . BS.Char8.pack . Tar.fromTarPath

Bodigrim commented 10 months ago

@mpilgrem I recommend against T.unpack . T.decodeUtf8Lenient . BS.Char8.pack . Tar.fromTarPath: if tar ever learns to support Unicode so that Tar.fromTarPath returns a Unicode-enabled String, then BS.Char8.pack allows to convert a seemingly innocent path without any dots and slashes to something like ../../Windows/System32/Kernel.dll and corrupt your system files.

78 is a way forward.

mpilgrem commented 10 months ago

@Bodigrim, thanks for the warning. My second attempt below makes use of isUTF8Encoded from the utf8-string package:

fromTarPath :: TarPath -> FilePath
fromTarPath tp = if isUTF8Encoded rawFilePath
  then
    T.unpack $ T.decodeUtf8Lenient $ BS.Char8.pack rawFilePath
  else
    -- A future version of Tar.fromTarPath may itself assume that 'TarPath' is
    -- UTF8 encoded.
    rawFilePath
 where
  rawFilePath = Tar.fromTarPath tp

hasufell commented 10 months ago

PR here: https://github.com/haskell/tar/pull/88

Bodigrim commented 10 months ago

Unicode filenames should work now, after aa683b0. I switched TarPath to PosixString; since it's not exposed, this is not a breaking change.

haskell / tar

Paths with unicode characters in them? #6

78 is a way forward.