Closed edsko closed 10 months ago
Yes, they do break! When package is created, entires (or headers, I don't know proper terminology) look like gibberish. In fact, tar
doesn't think it's a proper tar archive at all. I've spent about an hour looking where my app is doing something wrong, but it turns out that the library has bugs.
Here is an example:
~/Downloads $ tar -xvf foo.tar
/usr/bin/tar: This does not look like a tar archive
/usr/bin/tar: Skipping to next header
/usr/bin/tar: Exiting with failure status due to previous errors
And with help of Emacs I can see:
-rw-r--r-- 0/0 82492644 01 >65 E@0=8 :>@>;O!.flac
Where is something unprintable at all. This must be fixed ASAP.
@dcoutts, Is PR disarible or you can fix it yourself?
@mrkkrp this isn't a new problem right? It's never done unicode.
Yes, a comprehensive fix would be welcome, but this isn't easy. It still has to work with arbitrary unix files which are not necessarily unicode.
@dcoutts, I didn't know it's not supposed to work with Unicode. But well, it's 2016, Unicode is everywhere. And there are a lot of coutries that use non-Latin scripts, so once you choose to work with tar archives in Haskell and you have to deal with non-Latin script, you have this problem.
Oh, OK. Can you describe why exactly Unicode is so hard? All the tools for ByteString
ecoding/decoding available, and UTF-8 is the same as ASCII if it doesn't contain Unicode characters.
Also, where to look if I want to properly fix this? (I now either need to fix it or call extenral tar
application instead, which is not very pretty.)
Can't we just use utf8-string
for example and replace some calls to pack
/unpack
from Data.ByteString.Char8
with calls to fromString
/ toString
from Data.ByteString.UTF8
? That should work for file paths without Unicode characters as well as for those with Unicode characters in them. Am I missing something important here?
How do you know the paths are UTF8 encoded, and not something else?
I don't see any problems here. We're talking about FilePath
, which is a synonym for String
, list of Char
s. Every char is not a byte, but something that already can represent any Unicode value.
Now if we take just UTF-8, it's designed to be backward compatible with ASCII. This means that ByteString
representing UTF-8-encoded string is the same as ByteString
representing ASCII string (one byte per character, this how it currently works, as I understand). So, no regression will happen if we switch, with respect to this limited collection of characters, things will be all the same.
For non-ASCII characters however, it's not possible to represent them using only one byte per character, so there will be difference and Unicode paths will be represented by longer ByteString
s, but I don't see any problem here either, just put that sequence of bytes into that string table and extract them afterwards decoding them as UTF-8 strings.
I see the following problems, however:
Anyway, this change is a must, because otherwise use of this library is very limited (as Haskell community's tool to be used as part of applications used mostly by programmers).
Even if you deal with Latin alphabet only, there are various characters that can be in paths, like quotes: “” (note, they are different from "", which cannot be in paths in Windows, although they are in ASCII range, but “” on the other hand can, and they are proper punctuation to be used anyway), there are copyright signs ©, a lot of punctuation that is not in ASCII range.
I can imagine you don't use these things in names of source files, but this doesn't mean other (possibly non-technical) people don't put Unicode in names of files, and they may be direct users of some Haskell program that uses this library.
Hmm, TAR officially doesn't support non ASCII characters. Too bad, but I think I saw tar-archives that contain paths with Unicode in them. Strange, I'll need to read more about workaround and how it's generally done.
Anyway since tar specification specifies ASCII range explicitely and UTF-8 and ASCII are the same in that range, I think that idea with UTF-8 should be perfectly OK.
I'm waiting for @dcoutts opinion. Perhaps I should just use more-modern archive format. This is unbelievable that it doesn't support anything but ASCII, what a flaw…
So if it's specification that's broken, then I suggest we close the issue, because this library implements the specification well. I'll just switch to zip, it will be also more familiar for my non-techy users. Sorry for prolonged disscussion.
I'm not opposed to following whatever convention other tar impls use when it comes to unicode. But note that it isn't a trivial matter of sticking in a few to/fromUTF8 calls (remember that not all unix files are unicode but all windows/osx ones are). See for example https://docs.python.org/2/library/tarfile.html#tar-unicode
I think a good time to tackle this problem is when we add pax support (isssue #1). The posix pax standard explicitly supports file name encodings, and utf8 in particular.
I'm surprised by the discussion here. There is a very simple solution which is unambiguously the right thing to do: use withFilePath
from System.Posix.Internals
(in base
) to encode a FilePath
into the OS-specific encoding, and then blast that straight into the tarball. The point is that people expect tar
to work like how an invocation of the tar program on the filesystem would work, and the convention is that you just preserve the raw encoding of the data directly.
EDIT: OK, I'll retract this. If you followed my suggestion, then if you used tar
on Windows, all of the files would be blasted into the tarball using UTF-16 encoding. Which will totally do the right thing on Windows (Unicode will be supported properly) and also totally miss the point, if you were hoping to pass the tarball on to someone else. Ouch.
Can't we make this case an error instead of silently accepting? Current behaviour causes problems for users:
https://github.com/haskell/cabal/issues/3758 https://github.com/commercialhaskell/stack/pull/2557
I support erroring. The truncation from Char8.pack is basically never right, IMO.
Also, is there an interface for passing tar
direct ByteString
encodings of the desired file paths? This would at least let end users make a decision what encoding they want.
What's the status of this? The current implementation is breaking filenames. All filepaths should be ByteString (aka RawFilePath). This is a low-level library, if someone wants to add a String or Text interface on top, that's fine.
EDIT: afais gnu tar specifies:
The name, linkname, magic, uname, and gname are null-terminated character strings. All other fields are zero-filled octal numbers in ASCII.
But this probably isn't portable for Mac OS and windows...
EDIT2: I think I'll create a tar-bytestring
fork that is specifically targeted for POSIX platforms. At least that fixes half of the problem.
This is what tar-conduit does: https://github.com/snoyberg/tar-conduit/blob/81283887aaa9771c0f2db53cb4e86700da4c2d9e/src/Data/Conduit/Tar/Types.hs#L151
It encodes and decodes as UTF-8. I'd say that's a pretty good bet. For unpacking, we could provide a version that allows to set the encoding... or we make use of something like https://hackage.haskell.org/package/charsetdetect-ae
I pushed 423e6af, prohibiting non-ASCII file names. At the very least, we should not silently corrupt Unicode data. A stategic solution would be to migrate to PosixPath
and leave encoding questions to clients.
A stategic solution would be to migrate to PosixPath and leave encoding questions to clients.
There are some non-trivial parts there, because although the tar spec demands unix semantics, the library also works on windows (see toTarPath
). Since we use the FilePath
representation currently, we don't have to convert the filenames between the platforms (just the separators are changed). With OsPath, it seems we would need a way to convert between PosixPath
and WindowsPath
. So we kinda have to assume utf8 here too at least on windows?
Yes, I'd assume UTF-8 on Windows.
Would it be possible for Codec.Archive.Tar.Entry
to export the data constructor of TarPath
?
I've written something for Stack that works around fromTarPath
using BS.Char8.unpack
(Stack needs that to be (T.unpack . T.decodeUtf8Lenient)
), but the code needs access to the data constructor.
EDIT: In the interim, I've realised I can convert the FilePath back into a ByteString, and start again:
fromTarPath :: TarPath -> FilePath
fromTarPath = T.unpack . T.decodeUtf8Lenient . BS.Char8.pack . Tar.fromTarPath
@mpilgrem I recommend against T.unpack . T.decodeUtf8Lenient . BS.Char8.pack . Tar.fromTarPath
: if tar
ever learns to support Unicode so that Tar.fromTarPath
returns a Unicode-enabled String
, then BS.Char8.pack
allows to convert a seemingly innocent path without any dots and slashes to something like ../../Windows/System32/Kernel.dll
and corrupt your system files.
@Bodigrim, thanks for the warning. My second attempt below makes use of isUTF8Encoded
from the utf8-string
package:
fromTarPath :: TarPath -> FilePath
fromTarPath tp = if isUTF8Encoded rawFilePath
then
T.unpack $ T.decodeUtf8Lenient $ BS.Char8.pack rawFilePath
else
-- A future version of Tar.fromTarPath may itself assume that 'TarPath' is
-- UTF8 encoded.
rawFilePath
where
rawFilePath = Tar.fromTarPath tp
Unicode filenames should work now, after aa683b0. I switched TarPath
to PosixString
; since it's not exposed, this is not a breaking change.
Happened to see this, don't know if it actually matters or not. But
StringTable.construct
callsByteString.Char8.pack
, which throws away a lot of information. Paths with unicode characters will probably break?