pombreda / libarchive

Automatically exported from code.google.com/p/libarchive
Other
0 stars 0 forks source link

Wrong locale defaults for windows #132

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What version are you using?
trunk / revision 2953

On what operating system?
Windows 7

How did you build?  (cmake, configure, or pre-packaged binary)
cmake

What compiler or development environment (please include version)?
Visual Studio 2010

I have two comments.

1)  In line 464 of archive_string.c (
http://code.google.com/p/libarchive/source/browse/trunk/libarchive/archive_strin
g.c#464 ) used ACP code page, but ZIP and TAR (and may be other) archives 
created in Windows with using CP_OEMCP (OEM) character set for filenames (at 
least for russian it is true - ACP defines codepage 1251, but in archive names 
are in 866 codepage).

2) It is incorrect idea to use _system_ default locale to convert 
mbstring<->wcstring. Because if I have archive with russian filenames from 
FreeBSD it filenames is in koi8-r, and if I can't define charset for archive - 
I can't get proper names. 
The other problem - if i build libarchive as dll, then dll have it's own locale 
and I can't change it from program at all.

So I think it is need mechanism to change locale for string conversion for 
library (as minimum) or for archive (optimal).

At this time no one archiver that I tested (7zip, WinRar, bsdtar from 
libarchive) not extracted russian names correct from tar.bz2 archive created on 
FreeBSD 8.1.

Original issue reported on code.google.com by repa...@gmail.com on 31 Jan 2011 at 6:08

GoogleCodeExporter commented 9 years ago
What was the locale setting when you created the archive on FreeBSD?

Can you upload a small example of such an archive?

Original comment by kientzle@gmail.com on 1 Feb 2011 at 4:36

GoogleCodeExporter commented 9 years ago
Here two examples attached:
1) tar.gz - created on FreeBSD with charset koi8-r
2) zip - created on Windows with OEM codepage 866

More detail about archives and examples of incorrect charsets.

1) Here steps to create archive on freebsd 8.1
========================================
> setenv | egrep "CHAR|LANG"
LANG=ru_RU.KOI8-R
MM_CHARSET=KOI8-R

> ll tmp
total 6
-rw-r--r--  1 rs  user  6 31 янв 20:30 ПРИВЕТ_upcase.txt
-rw-r--r--  1 rs  user  6 31 янв 20:26 hello.txt
-rw-r--r--  1 rs  user  6 31 янв 20:43 привет_lowcase.txt

> bsdtar cyvf tmp.tar.gz tmp
a tmp
a tmp/привет_lowcase.txt
a tmp/hello.txt
a tmp/ПРИВЕТ_upcase.txt

> bsdtar tf tmp.tar.gz
tmp/
tmp/привет_lowcase.txt
tmp/hello.txt
tmp/ПРИВЕТ_upcase.txt

========================================
bsdtar used on freebsd is from freebsd release.

On windows with manually build of bsdtar (trunk / revision 2953):
========================================
E:\devel\_oss\archivepp>bsdtar tf tmp.tar.gz
tmp/
tmp/?OE?AO_lowcase.txt
tmp/hello.txt
tmp/?oe?ao_upcase.txt
========================================
I don't know about, PaxHeader folder in tar archive, but if it is meta 
information then it is incorrect. The files is utf-8 encoded data, and it has 
field "path" which has names converted to utf-8 from wrong codepage.

File tmp.tar.gz from this steps is attached as charset_koi8r.tar.gz

2) Zip archive created by next commands:
========================================
>dir tmp
01.02.2011  12:12    <DIR>          .
01.02.2011  12:12    <DIR>          ..
31.01.2011  20:26                 6 hello.txt
31.01.2011  20:43                 6 привет_lowcase.txt
31.01.2011  20:30                 6 ПРИВЕТ_upcase.txt

>"C:\Program Files\7-Zip\7z.exe" a -r tmp.zip tmp

7-Zip [64] 9.13 beta  Copyright (c) 1999-2010 Igor Pavlov  2010-04-15
Scanning

Creating archive tmp.zip

Compressing  tmp\hello.txt
Compressing  tmp\привет_lowcase.txt
Compressing  tmp\ПРИВЕТ_upcase.txt

Everything is Ok

>"C:\Program Files\7-Zip\7z.exe" l tmp.zip

7-Zip [64] 9.13 beta  Copyright (c) 1999-2010 Igor Pavlov  2010-04-15

Listing archive: tmp.zip

--
Path = tmp.zip
Type = zip
Physical Size = 608

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2011-02-01 12:12:25 D....            0            0  tmp
2011-01-31 20:26:54 ....A            6            6  tmp\hello.txt
2011-01-31 20:43:21 ....A            6            6  
tmp\привет_lowcase.txt
2011-01-31 20:30:21 ....A            6            6  tmp\ПРИВЕТ_upcase.txt
------------------- ----- ------------ ------------  ------------------------
                                    18           18  3 files, 1 folders

>bsdtar tf tmp.zip
tmp/
tmp/hello.txt
tmp/ЇаЁў?в_lowcase.txt
tmp/??\210':'_upcase.txt

========================================

The tmp.zip file from this example is attached as charset_cp866.zip.
File created from GUI (not command line) and by WinRar and internally by 
Explorer (as compressed ZIP folder) are use codepage 866 too.

As You see, bsdtar use codepage 1251 (ACP, not OEMCP) and print wrong file 
listing.

Original comment by repa...@gmail.com on 1 Feb 2011 at 9:29

Attachments:

GoogleCodeExporter commented 9 years ago
BTW, in issue 33 (about C++ wrapper) exists link to objective-C code. In this 
code author use narrow char version of functions (for example 
archive_entry_pathname), and then tie it with charset defined from outside.

Original comment by repa...@gmail.com on 1 Feb 2011 at 9:35

GoogleCodeExporter commented 9 years ago
Addition about PaxHeader folder in tar archive.

On FreeBSD mbstowcs/wctomb does not use current charset defined in locale to 
make conversion between multibyte and wchar. The right way is to use iconv (or 
something like this). 

So, as minimum I think it is need to modify archive_strappend_w_mbs and 
archive_wstrcpy_mbs (which make conversion) to use iconv instead of standard 
functions.
As maximum i think it is need API to define functions for conversion to/from 
unicode for each "archive *".
Also it can help solve problem of archives comes from different OS with 
different  codepages. For example with archives from unix which opened on 
windows.

Original comment by repa...@gmail.com on 7 Feb 2011 at 12:52

GoogleCodeExporter commented 9 years ago
I have done a few experiments and I agree that we should be using iconv in 
archive_string.c.

This is a big change, though, so I want to explain the implications in case I 
miss something.

First, I do not want to call iconv_open every time we want to convert one small 
string, so the iconv conversion descriptor needs to be stored somewhere and 
reused.  It cannot be stored globally without breaking threaded programs.  So I 
think it should be stored in "struct archive".  This is good---it means we can 
add an interface later to select the codepage for each archive *.  But it 
requires a struct archive to be passed into every place where a string 
conversion might happen.

Most of those are in archive_entry.  So I think we have to change 
archive_entry_new() to be:

 struct archive_entry *archive_entry_new(struct archive *);

This is a big change.

Until now, archive_entry objects have existed separately from archive objects; 
they could be used completely on their own.  With this change, every 
archive_entry object must be created in the context of a particular archive 
object.   The case that I'm not sure about:  What should happen when an entry 
is copied between archives with different codepages?  (For example, when 
copying from a tar archive to an archive_write_disk handle?)

Original comment by kientzle@gmail.com on 5 Mar 2011 at 6:11

GoogleCodeExporter commented 9 years ago
The following is my current draft for doing Unicode -> MBS conversions using 
iconv.

A few details:
   * If a NULL archive object is passed in, a default iconv object is opened and closed for each call.
   * If a non-NULL archive object is passed in, the iconv object will be cached there for re-use.
   * If a non-NULL archive object is passed in, the default MBS character encoding will be read from there.
I think this provides all of the use cases we would like to support.  The first 
is suitable for "detached" archive_entry objects, the last will allow us to 
select character encodings on a per-archive basis.

/*
 * Translates a wide character string into current locale character set
 * and appends to the archive_string.  Note: returns NULL if conversion
 * fails.
 *
 * This version uses the POSIX iconv() function.
 */
struct archive_string *
archive_strappend_w_mbs(struct archive *a, struct archive_string *as, const 
wchar_t *w)
{
    const char *p;
    char buff[256];
    size_t remaining;

    /* XXX Determine the actual name to use here instead of "UNICODE" XXX */
    /* For GNU iconv, we probably want UCS-2-INTERNAL if wchar_t is 16 bits
     * and UCS-4-INTERNAL if wchar_t is 32 bits. */
    if (a == NULL)
        cd = iconv_open("", "UNICODE");
    else if (a->unicode_to_current == (iconv_t)(0))
        cd = iconv_open(a->current_code, "UNICODE");
    else {
        /* Use the cached conversion descriptor after resetting it. */
        cd = a->unicode_to_current;
        iconv(cd, NULL, NULL, NULL, NULL);
    }
    /* XXX if (cd == (iconv_t)(-1)) { handle error } XXX */

    p = (const char *)w;
    remaining = wcslen(w) * sizeof(wchar_t);
    for (;;) {
        size_t avail = sizeof(buff);
        char *outp = buff;
        size_t result = iconv(cd, &p, &remaining, &outp, &avail);

        if (avail < sizeof(buff))
            archive_strncat(as, buff, sizeof(buff) - avail);
        if (result >= 0) {
            break; /* Conversion completed. */
        } else if (errno == EILSEQ) {
            /* Skip the illegal input wchar. */
            archive_append_char(as, '?');
            p += sizeof(wchar_t);
            remaining -= sizeof(wchar_t);
        } else if (errno == E2BIG) {
            /* Flush the output and do some more. */
            continue;
        } else if (errno == EINVAL) {
            /* Final character is invalid. */
            archive_append_char(as, '?');
            break;
        }
    }
    /* Dispose of the conversion descriptor or cache it. */
    if (a == NULL)
        iconv_close(cd);
    else if (a->unicode_to_current == (iconv_t)(0))
        a->unicode_to_current = cd;
    return (as);
}

Original comment by kientzle@gmail.com on 5 Mar 2011 at 8:25

GoogleCodeExporter commented 9 years ago
libarchive/trunk r2992 and r2993 change the charset conversion routines over to 
use iconv() when it's available.

This should fix the problem you saw on FreeBSD with KOI8-R filenames being 
incorrectly stored in pax archives (though I've so far only tested these 
changes on Mac OS).  I think I found a way to avoid breaking source 
compatibility; archive_entry_new() is unchanged and I've added 
archive_entry_new2(struct archive *) to create entries that are bound to a 
particular archive handle.  Internally, the iconv() conversion descriptors are 
stored in the archive handle to be reused for performance, and I've outlined 
support for allowing the character encoding to be set on a per-archive-handle 
basis.

There's still some code cleanup and testing that needs to be done; please let 
me know if you see any further problems.

With this change, bsdtar should correctly store KOI8-R filenames from FreeBSD.  
After we've finished testing that, I'll take a closer look at extraction on 
Windows.

Original comment by kientzle@gmail.com on 7 Mar 2011 at 6:04

GoogleCodeExporter commented 9 years ago
Build on FreeBSD8 with ports/convert/libiconv.
Build error at using iconv() by following message:
libarchive/archive_string.c: In function 'archive_wstring_append_from_mbs':
libarchive/archive_string.c:532: warning: passing argument 2 of 'libiconv' from
incompatible pointer type
libarchive/archive_string.c: In function 'archive_string_append_from_unicode_to_
mbs':
libarchive/archive_string.c:667: warning: passing argument 2 of 'libiconv' from
incompatible pointer type

There are different definitions for iconv() function; one is using 'const' to
the second argument and other is not.
As far I know, Solaris and NetBSD use 'const' but many Linux distro using GNU 
libiconv
do not.

Original comment by ggcueroad@gmail.com on 9 Mar 2011 at 4:48

GoogleCodeExporter commented 9 years ago
Tim, your changes cannot work with eucJP when using archvie_entry_pathname_w() 
on FreeBSD because of issue 66.

setlocale(LC_ALL, "ja_JP.eucJP");

archive_entry_set_pathname(e, "表.txt");// "\xC9\xBD.txt"

then archive_entry_pathname_w(e) returns 
      Dump of archive_entry_pathname_w(e)
0000_68_88_00 00 2e 00 00 00 74 00 00 00 78 00 00 00 h.......t...x...
0010 74 00 00 00                                     t...

but result of performing mbstowcs(ws, "表.txt") is 
      Dump of ws
0000_bd_c9_00 00 2e 00 00 00 74 00 00 00 78 00 00 00 ........t...x...
0010 74 00 00 00                                     t...

An attached patch is to test eucJP in test_pax_filename_encoding_freebsd.c.

Original comment by ggcueroad@gmail.com on 10 Mar 2011 at 3:17

Attachments:

GoogleCodeExporter commented 9 years ago
I have always intended that archive_entry_pathname_w() return Unicode.

The use of mbstowcs() was wrong, and causes bsdtar to generate incorrect 
pax-format archives; POSIX specifies that the filenames in pax archives are 
stored in Unicode.

The archive_entry_pathname_w() here is correct; 表 is \u8868

Original comment by kientzle@gmail.com on 10 Mar 2011 at 4:26

GoogleCodeExporter commented 9 years ago
r3007 adds more checks to CMakeLists.txt.

This should now work properly on FreeBSD (and other platforms with POSIX iconv) 
as well as on platforms with non-POSIX iconv() and which require libiconv.

Still missing autoconf machinery...

Original comment by kientzle@gmail.com on 11 Mar 2011 at 6:26

GoogleCodeExporter commented 9 years ago
Must the applications using libarchive treat wchar_t as Unicode even if 
__STDC_ISO_10646__
is not defined ?
On application side, to get wchar_t string is normally use of mbstowcs() or 
related functions.
What will be happen when using archive_entry_copy_pathname_w() in ja_JP.eucJP 
locale ?

On FreeBSD the archive_entry_pathname_w() with iconv returns Unicode as you say
but the function without iconv will return non Unicode string made with 
wcrtomb/wctomb,
that is quite different behavior.

And I have some concern about UTF-8-MAC, which is Normalization Form D(NFD).
Some platforms, which are X Window system, Windows XP or older Windows 
platform, cannot
correctly display UTF-8-MAC strings because they expect that UTF-8 is 
Normalization Form C (NFC) not NDF.

I think it is better that the way to get UTF-8 is converted from MBS with iconv 
than
converted from wchar_t string in assuming wchar_t as Unicode.

Original comment by ggcueroad@gmail.com on 11 Mar 2011 at 3:32

GoogleCodeExporter commented 9 years ago
"I think it is better that the way to get UTF-8 is converted from MBS with 
iconv..."

Yes, I completely agree.  I want to make this change.

"What will be happen when using archive_entry_copy_pathname_w() in ja_JP.eucJP 
locale ?"

How do people use this?  This is needed on Windows to use the Unicode file 
APIs, but I don't know about other systems.

Right now, we're using these internally for Pax, Joliet, and CAB filenames, but 
those are all Unicode, so our Pax, Joliet, and CAB support is all somewhat 
broken right now.

We could add archive_entry_copy_pathname_unicode() and leave the _w() forms as 
locale-dependent wchar_t.  But I don't know if the non-Unicode _w forms are 
useful; we can't really test them, because the results are both locale- and 
system-dependent.

Original comment by kientzle@gmail.com on 11 Mar 2011 at 4:44

GoogleCodeExporter commented 9 years ago
Here's another way to think about this:

The reason for libarchive to support character-set conversions is to properly 
support the various archive formats.

  Pax uses UTF-8 for filenames
  Joliet (and CAB?) use Unicode (technically UTF-16)
  Almost everything else uses locale-dependent MBS.

So we must support those three.  It's not clear to me why we need to support 
locale-dependent WCS.  mbstowcs() is readily available for people who need to 
work with WCS.

I'm happy to rename the existing _w functions to _unicode if that makes things 
more clear.

Original comment by kientzle@gmail.com on 11 Mar 2011 at 5:09

GoogleCodeExporter commented 9 years ago
Some people use UTF-8 but other people still use eucJP, who say that the 
applications,
which they usually use, do not work properly in UTF-8. They are not only 
FreeBSD users,
NetBSD and Solaris users choose eucJP by the same reason or from force habit.

About filenames in Japan. There are many character-set.
For example, Windows platform uses CP932 aka MS-SJIS,
Linux platforms use UTF-8, MAC OS X uses UTF-8-MAC though locale name is UTF-8,
other POSIX platforms can use eucJP, UTF-8 and SJIS.
And thus I would like to support 'charset' option to specify a filename 
character-set
as "bsdtar -tvf file.cab --options cab:charset=CP932".

CAB can store filenames with UTF-8 but UTF-8 is not used, so CAB usually uses
local-dependent MBS for stored filenames. It means POSIX platforms may be not 
able to
display the filenames even if the platforms use UTF-8 locale. This is an 
another reason
why 'charset' option is needed. I think it is useful that at least CAB, LHA and 
ZIP
format reader would support the option.

I think we should worry about what WCS format is when UTF-8 is completely 
converted to/from
MBS with iconv; WCS is no longer needed to get UTF-8. Therefore, I do not think 
that renaming _w
functions is needed.

Original comment by ggcueroad@gmail.com on 11 Mar 2011 at 7:28

GoogleCodeExporter commented 9 years ago
"... I would like to support 'charset' option to specify a filename 
character-set
as "bsdtar -tvf file.cab --options cab:charset=CP932""

The original bug report suggested something like this, and I want to understand 
what this means.

The point of specifying the archive encoding for "bsdtar -x" is so that the 
filename can be converted from that character set to the character set used for 
filenames.  So if you're using MS-SJIS for filenames and the archive stores 
filenames in CP932, we need to translate those filenames from CP932 to MS-SJIS 
so we can write them to disk.  Is this right?

So it sounds like we should be storing a character encoding with every 
filename.  (The code I have just committed does this by storing a character 
encoding for the archive; is that sufficient?)  We then need a way to convert 
those filenames to the character encoding used by disk files.

So we really need something like:
   /* The archive is storing filenames in CP932 */
   archive_set_charset(a, "CP932);

   /* Get an entry from archive_read_next_header() */

   /* Now we want to open the file using MS-SJIS, so... */
   filename = archive_entry_pathname_in_charset(entry, "MS-SJIS")

If you needed UTF-8, you could ask for that here.

Original comment by kientzle@gmail.com on 12 Mar 2011 at 7:33

GoogleCodeExporter commented 9 years ago
MS-SJIS is Microsoft SJIS, an alias of CP932, which is extended by Microsoft 
and OEM vendors of
MS-DOS/Windows, from SJIS(Shift_JIS character-set).
We need to translate filenames from CP932 stored on Windows platform to UTF-8 
or eucJP
,which most of POSIX like system use, or, as I mentioned filenames on MAC OS X, 
translate
filenames from UTF-8-MAC(NFD) to UTF-8(NFC) or other character-set such as 
eucJP.

>   /* The archive is storing filenames in CP932 */
>   archive_set_charset(a, "CP932);
This is exactly the API I need!

>   /* Now we want to open the file using MS-SJIS, so... */
>   filename = archive_entry_pathname_in_charset(entry, "MS-SJIS")
This would be useful.
For an ISO Writer, to get UTF-16 characters, do you think it can be like this:
  filename_utf16 = (uint16_t *)archive_entry_pathname_in_charset(entry, "UTF-16LE") ?
Since the writer has to store filenames with multiple-character-set into an 
ISO-image.

I hope that all of converting character-set code in format-readers and writers 
would be removed.

Original comment by ggcueroad@gmail.com on 13 Mar 2011 at 2:22

GoogleCodeExporter commented 9 years ago
Since libarchive now has full iconv integration for handling filenames in 
various character sets, I believe this issue is now Fixed.

If you think otherwise, let me know what you think is still missing. 

Original comment by kientzle@gmail.com on 26 Nov 2011 at 9:58