Closed GoogleCodeExporter closed 9 years ago
What was the locale setting when you created the archive on FreeBSD?
Can you upload a small example of such an archive?
Original comment by kientzle@gmail.com
on 1 Feb 2011 at 4:36
Here two examples attached:
1) tar.gz - created on FreeBSD with charset koi8-r
2) zip - created on Windows with OEM codepage 866
More detail about archives and examples of incorrect charsets.
1) Here steps to create archive on freebsd 8.1
========================================
> setenv | egrep "CHAR|LANG"
LANG=ru_RU.KOI8-R
MM_CHARSET=KOI8-R
> ll tmp
total 6
-rw-r--r-- 1 rs user 6 31 янв 20:30 ПРИВЕТ_upcase.txt
-rw-r--r-- 1 rs user 6 31 янв 20:26 hello.txt
-rw-r--r-- 1 rs user 6 31 янв 20:43 привет_lowcase.txt
> bsdtar cyvf tmp.tar.gz tmp
a tmp
a tmp/привет_lowcase.txt
a tmp/hello.txt
a tmp/ПРИВЕТ_upcase.txt
> bsdtar tf tmp.tar.gz
tmp/
tmp/привет_lowcase.txt
tmp/hello.txt
tmp/ПРИВЕТ_upcase.txt
========================================
bsdtar used on freebsd is from freebsd release.
On windows with manually build of bsdtar (trunk / revision 2953):
========================================
E:\devel\_oss\archivepp>bsdtar tf tmp.tar.gz
tmp/
tmp/?OE?AO_lowcase.txt
tmp/hello.txt
tmp/?oe?ao_upcase.txt
========================================
I don't know about, PaxHeader folder in tar archive, but if it is meta
information then it is incorrect. The files is utf-8 encoded data, and it has
field "path" which has names converted to utf-8 from wrong codepage.
File tmp.tar.gz from this steps is attached as charset_koi8r.tar.gz
2) Zip archive created by next commands:
========================================
>dir tmp
01.02.2011 12:12 <DIR> .
01.02.2011 12:12 <DIR> ..
31.01.2011 20:26 6 hello.txt
31.01.2011 20:43 6 привет_lowcase.txt
31.01.2011 20:30 6 ПРИВЕТ_upcase.txt
>"C:\Program Files\7-Zip\7z.exe" a -r tmp.zip tmp
7-Zip [64] 9.13 beta Copyright (c) 1999-2010 Igor Pavlov 2010-04-15
Scanning
Creating archive tmp.zip
Compressing tmp\hello.txt
Compressing tmp\привет_lowcase.txt
Compressing tmp\ПРИВЕТ_upcase.txt
Everything is Ok
>"C:\Program Files\7-Zip\7z.exe" l tmp.zip
7-Zip [64] 9.13 beta Copyright (c) 1999-2010 Igor Pavlov 2010-04-15
Listing archive: tmp.zip
--
Path = tmp.zip
Type = zip
Physical Size = 608
Date Time Attr Size Compressed Name
------------------- ----- ------------ ------------ ------------------------
2011-02-01 12:12:25 D.... 0 0 tmp
2011-01-31 20:26:54 ....A 6 6 tmp\hello.txt
2011-01-31 20:43:21 ....A 6 6
tmp\привет_lowcase.txt
2011-01-31 20:30:21 ....A 6 6 tmp\ПРИВЕТ_upcase.txt
------------------- ----- ------------ ------------ ------------------------
18 18 3 files, 1 folders
>bsdtar tf tmp.zip
tmp/
tmp/hello.txt
tmp/ЇаЁў?в_lowcase.txt
tmp/??\210':'_upcase.txt
========================================
The tmp.zip file from this example is attached as charset_cp866.zip.
File created from GUI (not command line) and by WinRar and internally by
Explorer (as compressed ZIP folder) are use codepage 866 too.
As You see, bsdtar use codepage 1251 (ACP, not OEMCP) and print wrong file
listing.
Original comment by repa...@gmail.com
on 1 Feb 2011 at 9:29
Attachments:
BTW, in issue 33 (about C++ wrapper) exists link to objective-C code. In this
code author use narrow char version of functions (for example
archive_entry_pathname), and then tie it with charset defined from outside.
Original comment by repa...@gmail.com
on 1 Feb 2011 at 9:35
Addition about PaxHeader folder in tar archive.
On FreeBSD mbstowcs/wctomb does not use current charset defined in locale to
make conversion between multibyte and wchar. The right way is to use iconv (or
something like this).
So, as minimum I think it is need to modify archive_strappend_w_mbs and
archive_wstrcpy_mbs (which make conversion) to use iconv instead of standard
functions.
As maximum i think it is need API to define functions for conversion to/from
unicode for each "archive *".
Also it can help solve problem of archives comes from different OS with
different codepages. For example with archives from unix which opened on
windows.
Original comment by repa...@gmail.com
on 7 Feb 2011 at 12:52
I have done a few experiments and I agree that we should be using iconv in
archive_string.c.
This is a big change, though, so I want to explain the implications in case I
miss something.
First, I do not want to call iconv_open every time we want to convert one small
string, so the iconv conversion descriptor needs to be stored somewhere and
reused. It cannot be stored globally without breaking threaded programs. So I
think it should be stored in "struct archive". This is good---it means we can
add an interface later to select the codepage for each archive *. But it
requires a struct archive to be passed into every place where a string
conversion might happen.
Most of those are in archive_entry. So I think we have to change
archive_entry_new() to be:
struct archive_entry *archive_entry_new(struct archive *);
This is a big change.
Until now, archive_entry objects have existed separately from archive objects;
they could be used completely on their own. With this change, every
archive_entry object must be created in the context of a particular archive
object. The case that I'm not sure about: What should happen when an entry
is copied between archives with different codepages? (For example, when
copying from a tar archive to an archive_write_disk handle?)
Original comment by kientzle@gmail.com
on 5 Mar 2011 at 6:11
The following is my current draft for doing Unicode -> MBS conversions using
iconv.
A few details:
* If a NULL archive object is passed in, a default iconv object is opened and closed for each call.
* If a non-NULL archive object is passed in, the iconv object will be cached there for re-use.
* If a non-NULL archive object is passed in, the default MBS character encoding will be read from there.
I think this provides all of the use cases we would like to support. The first
is suitable for "detached" archive_entry objects, the last will allow us to
select character encodings on a per-archive basis.
/*
* Translates a wide character string into current locale character set
* and appends to the archive_string. Note: returns NULL if conversion
* fails.
*
* This version uses the POSIX iconv() function.
*/
struct archive_string *
archive_strappend_w_mbs(struct archive *a, struct archive_string *as, const
wchar_t *w)
{
const char *p;
char buff[256];
size_t remaining;
/* XXX Determine the actual name to use here instead of "UNICODE" XXX */
/* For GNU iconv, we probably want UCS-2-INTERNAL if wchar_t is 16 bits
* and UCS-4-INTERNAL if wchar_t is 32 bits. */
if (a == NULL)
cd = iconv_open("", "UNICODE");
else if (a->unicode_to_current == (iconv_t)(0))
cd = iconv_open(a->current_code, "UNICODE");
else {
/* Use the cached conversion descriptor after resetting it. */
cd = a->unicode_to_current;
iconv(cd, NULL, NULL, NULL, NULL);
}
/* XXX if (cd == (iconv_t)(-1)) { handle error } XXX */
p = (const char *)w;
remaining = wcslen(w) * sizeof(wchar_t);
for (;;) {
size_t avail = sizeof(buff);
char *outp = buff;
size_t result = iconv(cd, &p, &remaining, &outp, &avail);
if (avail < sizeof(buff))
archive_strncat(as, buff, sizeof(buff) - avail);
if (result >= 0) {
break; /* Conversion completed. */
} else if (errno == EILSEQ) {
/* Skip the illegal input wchar. */
archive_append_char(as, '?');
p += sizeof(wchar_t);
remaining -= sizeof(wchar_t);
} else if (errno == E2BIG) {
/* Flush the output and do some more. */
continue;
} else if (errno == EINVAL) {
/* Final character is invalid. */
archive_append_char(as, '?');
break;
}
}
/* Dispose of the conversion descriptor or cache it. */
if (a == NULL)
iconv_close(cd);
else if (a->unicode_to_current == (iconv_t)(0))
a->unicode_to_current = cd;
return (as);
}
Original comment by kientzle@gmail.com
on 5 Mar 2011 at 8:25
libarchive/trunk r2992 and r2993 change the charset conversion routines over to
use iconv() when it's available.
This should fix the problem you saw on FreeBSD with KOI8-R filenames being
incorrectly stored in pax archives (though I've so far only tested these
changes on Mac OS). I think I found a way to avoid breaking source
compatibility; archive_entry_new() is unchanged and I've added
archive_entry_new2(struct archive *) to create entries that are bound to a
particular archive handle. Internally, the iconv() conversion descriptors are
stored in the archive handle to be reused for performance, and I've outlined
support for allowing the character encoding to be set on a per-archive-handle
basis.
There's still some code cleanup and testing that needs to be done; please let
me know if you see any further problems.
With this change, bsdtar should correctly store KOI8-R filenames from FreeBSD.
After we've finished testing that, I'll take a closer look at extraction on
Windows.
Original comment by kientzle@gmail.com
on 7 Mar 2011 at 6:04
Build on FreeBSD8 with ports/convert/libiconv.
Build error at using iconv() by following message:
libarchive/archive_string.c: In function 'archive_wstring_append_from_mbs':
libarchive/archive_string.c:532: warning: passing argument 2 of 'libiconv' from
incompatible pointer type
libarchive/archive_string.c: In function 'archive_string_append_from_unicode_to_
mbs':
libarchive/archive_string.c:667: warning: passing argument 2 of 'libiconv' from
incompatible pointer type
There are different definitions for iconv() function; one is using 'const' to
the second argument and other is not.
As far I know, Solaris and NetBSD use 'const' but many Linux distro using GNU
libiconv
do not.
Original comment by ggcueroad@gmail.com
on 9 Mar 2011 at 4:48
Tim, your changes cannot work with eucJP when using archvie_entry_pathname_w()
on FreeBSD because of issue 66.
setlocale(LC_ALL, "ja_JP.eucJP");
archive_entry_set_pathname(e, "表.txt");// "\xC9\xBD.txt"
then archive_entry_pathname_w(e) returns
Dump of archive_entry_pathname_w(e)
0000_68_88_00 00 2e 00 00 00 74 00 00 00 78 00 00 00 h.......t...x...
0010 74 00 00 00 t...
but result of performing mbstowcs(ws, "表.txt") is
Dump of ws
0000_bd_c9_00 00 2e 00 00 00 74 00 00 00 78 00 00 00 ........t...x...
0010 74 00 00 00 t...
An attached patch is to test eucJP in test_pax_filename_encoding_freebsd.c.
Original comment by ggcueroad@gmail.com
on 10 Mar 2011 at 3:17
Attachments:
I have always intended that archive_entry_pathname_w() return Unicode.
The use of mbstowcs() was wrong, and causes bsdtar to generate incorrect
pax-format archives; POSIX specifies that the filenames in pax archives are
stored in Unicode.
The archive_entry_pathname_w() here is correct; 表 is \u8868
Original comment by kientzle@gmail.com
on 10 Mar 2011 at 4:26
r3007 adds more checks to CMakeLists.txt.
This should now work properly on FreeBSD (and other platforms with POSIX iconv)
as well as on platforms with non-POSIX iconv() and which require libiconv.
Still missing autoconf machinery...
Original comment by kientzle@gmail.com
on 11 Mar 2011 at 6:26
Must the applications using libarchive treat wchar_t as Unicode even if
__STDC_ISO_10646__
is not defined ?
On application side, to get wchar_t string is normally use of mbstowcs() or
related functions.
What will be happen when using archive_entry_copy_pathname_w() in ja_JP.eucJP
locale ?
On FreeBSD the archive_entry_pathname_w() with iconv returns Unicode as you say
but the function without iconv will return non Unicode string made with
wcrtomb/wctomb,
that is quite different behavior.
And I have some concern about UTF-8-MAC, which is Normalization Form D(NFD).
Some platforms, which are X Window system, Windows XP or older Windows
platform, cannot
correctly display UTF-8-MAC strings because they expect that UTF-8 is
Normalization Form C (NFC) not NDF.
I think it is better that the way to get UTF-8 is converted from MBS with iconv
than
converted from wchar_t string in assuming wchar_t as Unicode.
Original comment by ggcueroad@gmail.com
on 11 Mar 2011 at 3:32
"I think it is better that the way to get UTF-8 is converted from MBS with
iconv..."
Yes, I completely agree. I want to make this change.
"What will be happen when using archive_entry_copy_pathname_w() in ja_JP.eucJP
locale ?"
How do people use this? This is needed on Windows to use the Unicode file
APIs, but I don't know about other systems.
Right now, we're using these internally for Pax, Joliet, and CAB filenames, but
those are all Unicode, so our Pax, Joliet, and CAB support is all somewhat
broken right now.
We could add archive_entry_copy_pathname_unicode() and leave the _w() forms as
locale-dependent wchar_t. But I don't know if the non-Unicode _w forms are
useful; we can't really test them, because the results are both locale- and
system-dependent.
Original comment by kientzle@gmail.com
on 11 Mar 2011 at 4:44
Here's another way to think about this:
The reason for libarchive to support character-set conversions is to properly
support the various archive formats.
Pax uses UTF-8 for filenames
Joliet (and CAB?) use Unicode (technically UTF-16)
Almost everything else uses locale-dependent MBS.
So we must support those three. It's not clear to me why we need to support
locale-dependent WCS. mbstowcs() is readily available for people who need to
work with WCS.
I'm happy to rename the existing _w functions to _unicode if that makes things
more clear.
Original comment by kientzle@gmail.com
on 11 Mar 2011 at 5:09
Some people use UTF-8 but other people still use eucJP, who say that the
applications,
which they usually use, do not work properly in UTF-8. They are not only
FreeBSD users,
NetBSD and Solaris users choose eucJP by the same reason or from force habit.
About filenames in Japan. There are many character-set.
For example, Windows platform uses CP932 aka MS-SJIS,
Linux platforms use UTF-8, MAC OS X uses UTF-8-MAC though locale name is UTF-8,
other POSIX platforms can use eucJP, UTF-8 and SJIS.
And thus I would like to support 'charset' option to specify a filename
character-set
as "bsdtar -tvf file.cab --options cab:charset=CP932".
CAB can store filenames with UTF-8 but UTF-8 is not used, so CAB usually uses
local-dependent MBS for stored filenames. It means POSIX platforms may be not
able to
display the filenames even if the platforms use UTF-8 locale. This is an
another reason
why 'charset' option is needed. I think it is useful that at least CAB, LHA and
ZIP
format reader would support the option.
I think we should worry about what WCS format is when UTF-8 is completely
converted to/from
MBS with iconv; WCS is no longer needed to get UTF-8. Therefore, I do not think
that renaming _w
functions is needed.
Original comment by ggcueroad@gmail.com
on 11 Mar 2011 at 7:28
"... I would like to support 'charset' option to specify a filename
character-set
as "bsdtar -tvf file.cab --options cab:charset=CP932""
The original bug report suggested something like this, and I want to understand
what this means.
The point of specifying the archive encoding for "bsdtar -x" is so that the
filename can be converted from that character set to the character set used for
filenames. So if you're using MS-SJIS for filenames and the archive stores
filenames in CP932, we need to translate those filenames from CP932 to MS-SJIS
so we can write them to disk. Is this right?
So it sounds like we should be storing a character encoding with every
filename. (The code I have just committed does this by storing a character
encoding for the archive; is that sufficient?) We then need a way to convert
those filenames to the character encoding used by disk files.
So we really need something like:
/* The archive is storing filenames in CP932 */
archive_set_charset(a, "CP932);
/* Get an entry from archive_read_next_header() */
/* Now we want to open the file using MS-SJIS, so... */
filename = archive_entry_pathname_in_charset(entry, "MS-SJIS")
If you needed UTF-8, you could ask for that here.
Original comment by kientzle@gmail.com
on 12 Mar 2011 at 7:33
MS-SJIS is Microsoft SJIS, an alias of CP932, which is extended by Microsoft
and OEM vendors of
MS-DOS/Windows, from SJIS(Shift_JIS character-set).
We need to translate filenames from CP932 stored on Windows platform to UTF-8
or eucJP
,which most of POSIX like system use, or, as I mentioned filenames on MAC OS X,
translate
filenames from UTF-8-MAC(NFD) to UTF-8(NFC) or other character-set such as
eucJP.
> /* The archive is storing filenames in CP932 */
> archive_set_charset(a, "CP932);
This is exactly the API I need!
> /* Now we want to open the file using MS-SJIS, so... */
> filename = archive_entry_pathname_in_charset(entry, "MS-SJIS")
This would be useful.
For an ISO Writer, to get UTF-16 characters, do you think it can be like this:
filename_utf16 = (uint16_t *)archive_entry_pathname_in_charset(entry, "UTF-16LE") ?
Since the writer has to store filenames with multiple-character-set into an
ISO-image.
I hope that all of converting character-set code in format-readers and writers
would be removed.
Original comment by ggcueroad@gmail.com
on 13 Mar 2011 at 2:22
Since libarchive now has full iconv integration for handling filenames in
various character sets, I believe this issue is now Fixed.
If you think otherwise, let me know what you think is still missing.
Original comment by kientzle@gmail.com
on 26 Nov 2011 at 9:58
Original issue reported on code.google.com by
repa...@gmail.com
on 31 Jan 2011 at 6:08