rrthomas / recode

Charset converter tool and library
GNU General Public License v3.0
130 stars 12 forks source link

ISO-10646-UCS-2 gives wrong endianess in 3.7 #6

Closed RalphCorderoy closed 6 years ago

RalphCorderoy commented 6 years ago

My familiar recode ..dump is broken with 3.7.

$ printf '\ua3' | recode -v ..dump
Request: UTF-8..:iconv:..ISO-10646-UCS-2..dump-with-names
Shrunk to: UTF-8..ISO-10646-UCS-2..dump-with-names
UCS2   Mne   Description

A300         syllabe yi nzup
$

The UCS2 column should be 00A3. It seems the problem is the byte order of ISO-10646-UCS-2 output. With 3.6:

$ printf A | recode UTF-8..ISO-10646-UCS-2 | sed -n l
\000A$

3.7:

$ printf A | recode UTF-8..ISO-10646-UCS-2 | sed -n l
A\000$

The Info manual for both versions says

By default, when producing an 'UCS-2' file, Recode always outputs the
high order byte before the low order byte.

That's my expectation, and dump's, and 3.6 does it. 3.7 doesn't and this means dump gives bogus results.

rrthomas commented 6 years ago

Thanks very much for the precise report; I can reproduce your results. I'll look into it now.

rrthomas commented 6 years ago

This is a problem with the use of iconv. While I look into it further, there's a workaround: use -x: to recode to prevent it using iconv.

rrthomas commented 6 years ago

If I just use iconv:

iconv -f UTF-8 -t UCS2 a.txt |sed -n l
A\000$

On the other hand:

iconv -f UTF-8 -t UCS2BE a.txt |sed -n l
\000A$

So it seems iconv has a different default from recode.

I'm not sure that this qualifies as a bug in iconv (though you're welcome to file a bug report against glibc if you think it is). Moreover, I don't think there's much I can do, in general, about iconv implementations; nor do I think recode should try to second-guess iconv.

So I think the best thing is to recommend the use of -x:, and document the perils of using iconv more clearly. (Again, for now I'm loth to switch iconv off by default.)

If you have any suggestions for a better solution, please make it!

Maelan commented 6 years ago

Well, wouldn’t it be possible to explicitely ask iconv for BE?

If the iconv’s default endianness is unsure, I would say that relying on it definitely is a bug on recode’s side. At least, as a user, I shouldn’t have to pass recode ..dump some obscure option to alter its internal behaviour, in order to get it to work.

I tried to investigate the code base for this issue. So I figured out that the function for ..dump is produce_full_dump, and that it reads characters with get_ucs2. get_ucs2 assumes big-endianness by default (here). But I am not familiar enough with the code base to tell what the input of produce_full_dump comes from. :-(


By the way, using UCS-2 internally seems wrong to me, but probably there has been discussion about it already, hasn’t it?

# what about U+01D49E ?

# using iconv:

$ echo -n '𝒞' |recode .. |utf8dump
    𝒞   U+1D49E     .F0.9D.92.9E
$ echo -n '𝒞' |recode utf8..utf8 |utf8dump
    𝒞   U+1D49E     .F0.9D.92.9E
$ echo -n '𝒞' |recode utf8..
recode: Untranslatable input in step `UTF-8..ANSI_X3.4-1968'
$ echo -n '𝒞' |recode ..utf8
recode: Invalid input in step `ANSI_X3.4-1968..UTF-8'
$ echo -n '𝒞' |recode utf8..utf16
recode: Untranslatable input in step `UTF-8..UTF-16'
$ echo -n '𝒞' |recode ..utf16
recode: Invalid input in step `ANSI_X3.4-1968..UTF-16'
$  echo -n '𝒞' |recode utf8..dump
recode: Ambiguous output in step `ISO-10646-UCS-2..dump-with-names'
$  echo -n '𝒞' |recode ..dump
recode: Ambiguous output in step `ISO-10646-UCS-2..dump-with-names'

# with iconv disabled:

$  echo -n '𝒞' |recode -x: .. |utf8dump
    𝒞   U+1D49E     .F0.9D.92.9E
$  echo -n '𝒞' |recode -x: utf8..utf8 |utf8dump
    𝒞   U+1D49E     .F0.9D.92.9E
$  echo -n '𝒞' |recode -x: utf8..
recode: Ambiguous output in step `ISO-10646-UCS-2..ANSI_X3.4-1968'
$  echo -n '𝒞' |recode -x: ..utf8
recode: Ambiguous output in step `ISO-10646-UCS-2..UTF-8'
$  echo -n '𝒞' |recode -x: utf8..utf16 |hexdump -C
00000000  fe ff d8 35 dc 9e                                 |...5..|
# ↑ surprisingly correct
$  echo -n '𝒞' |recode -x: ..utf16
recode: Ambiguous output in step `ISO-10646-UCS-2..UTF-16'
$  echo -n '𝒞' |recode -x: utf8..dump
recode: Ambiguous output in step `ISO-10646-UCS-2..dump-with-names'
$  echo -n '𝒞' |recode -x: ..dump
recode: Ambiguous output in step `ISO-10646-UCS-2..dump-with-names'
rrthomas commented 6 years ago

Thanks for looking into this further. I am inclining towards thinking that recode should by default not use iconv unless it could not otherwise perform the given recoding. At least that would mean that recode's behavior is more consistent.

RalphCorderoy commented 6 years ago

I don't understand why this isn't considered a recode 3.7 bug given it worked in 3.6. If glibc's iconv had altered a major default such as this then it would have broken more than just recode. Could it be that the tocode passed to iconv_open(3) is now calculated differently? I have trouble building 3.6 so can't compare, but...

$ git describe
v3.7
$ git di
diff --git src/iconv.c src/iconv.c
index 1e4bd4a..cac3115 100644
--- src/iconv.c
+++ src/iconv.c
@@ -169,8 +169,13 @@ wrapped_transform (iconv_t conversion, RECODE_SUBTASK subtask)
 bool
 transform_with_iconv (RECODE_SUBTASK subtask)
 {
+  char *s;
   RECODE_CONST_STEP step = subtask->step;
-  iconv_t conversion = iconv_open (step->after->iconv_name,
+  if (!strcmp(step->after->iconv_name, "UCS2"))
+    s = "UCS-2BE";
+  else
+    s = step->after->iconv_name;
+  iconv_t conversion = iconv_open (s,
                                    step->before->iconv_name);
   bool status;

$
$ printf A | src/recode -v ..dump
Request: UTF-8..:iconv:..ISO-10646-UCS-2..dump-with-names
Shrunk to: UTF-8..ISO-10646-UCS-2..dump-with-names
UCS2   Mne   Description

0041   A     latin capital letter a
$
rrthomas commented 6 years ago

@RalphCorderoy I don't view this as a bug because 3.6 behaved differently, but both behaviours correspond to recode's documentation. I am not suggesting that iconv has changed its behaviour.

As I said above, I don't like the idea of changing iconv defaults silently; I'd rather that when you use recode via iconv, it uses iconv as-is unless you tell it otherwise.

Maelan commented 6 years ago

I don’t get how “both behaviours correspond to recode's documentation”. The obvious claim for the dump feature is that it dumps the UCS codes it has been fed with. This belief is apparently supported by the documentation:

5.7 Fully interpreted UCS dump

Another device may be used to get fully interpreted dumps of an 'UCS-2' stream of characters, with one 'UCS-2' character displayed on a full output line. Each line receives the RFC 1345 mnemonic for the character if it exists, the 'UCS-2' value of the character, and a descriptive comment for that character. […] The current implementation allows for dumping charsets other than 'UCS-2'. […]

Hence, recode X..dump supposedly receives as input a stream of characters in some encoding X; and produces the same stream of characters in a special “dump” format.

This is how version 3.7 behaves:

$  echo -n ab… |recode utf8..dump
UCS2   Mne   Description

6100      
6200      
2620         tête de mort

In the example above, recode’s input clearly is the UTF8-encoded sequence of code points U+0061 (latin small letter a), U+0062 (latin small letter b), U+2026 (midline horizontal ellipsis). recode’s output clearly is the dump-formatted sequence of code points U+6100, U+6200 (both CJK ideographs with apparently no name in the universal repertoire), U+2620 (skull and crossbones — don’t ask me why it uses the French name). The output sequence differs from the input sequence. Therefore, this is plainly wrong with respect to the documentation.

For reference, here is the output with version 3.6:

$  echo -n ab |recode utf8..dump
UCS2   Mne   Description

0061   a     latin small letter a
0062   b     latin small letter b
2026   .3    midline horizontal ellipsis

More fun with 3.7:

$ echo -n ab… |recode utf8..ucs2 | recode ucs2..utf8
愀戀☠

If I understand you correctly, you are suggesting that the solution is to document that recode’s output may be wrong because of a blind internal use of some other tool — which happens to be called iconv. So that there is no guarantee for the user that recode works. To me, that’s against the principle of abstraction.

rrthomas commented 6 years ago

As the documentation also says, effectively all bets are off if you build with iconv, because its behavior is not fully specified.

As you say, this goes against the principle of abstraction. See #8 for the solution.

RalphCorderoy commented 6 years ago

Hi @rrthomas,

I don't view this as a bug because 3.6 behaved differently, but both behaviours correspond to recode's documentation.

When I opened this issue I pointed out that 3.6's behaviour did match the documentation and 3.7's doesn't: The Info manual for both versions says

By default, when producing an 'UCS-2' file, Recode always outputs the
high order byte before the low order byte.

dump is expecting UCS-2 to match the documentation too, not just me, but

3.6$ printf A | recode UTF-8..UCS-2 | sed -n l
\000A$
3.7$ printf A | recode UTF-8..UCS-2 | sed -n l
A\000$

I am not suggesting that iconv has changed its behaviour.

OK. So I don't understand your comment about 'all bets being off' if iconv is used because it doesn't fully specify its behaviour. If the behaviour hasn't changed then lack of specification doesn't cause this \000A v. A\000 problem?

As I said above, I don't like the idea of changing iconv defaults silently; I'd rather that when you use recode via iconv, it uses iconv as-is unless you tell it otherwise.

I would also favour using iconv by default if it's available rather than having a second overlapping implementation where they may not agree. Can you fill in some history for those of us that use Francois's version for years and don't know what changed when. It might help clarify. Was recode using iconv when Francois released it? I find 3.6 suggests iconv is being used when it produces the correct result above.

3.6$ printf A | recode -v UTF-8..UCS-2 | sed -n l
Request: UTF-8..:libiconv:..ISO-10646-UCS-2
Shrunk to: UTF-8..ISO-10646-UCS-2
\000A$

Has the ./configure default for whether iconv is used if present changed between 3.6 and 3.7?

It seems odd that the resolution to this is to raise issues on all the major distros requesting iconv be disabled so 3.7 gives compatible output with 3.6.

rrthomas commented 6 years ago

I am sorry, I was unclear. When I said that the behaviour matched the documentation, I meant that it is documented that using iconv can change the behaviour. If you disable iconv, you get the documented behaviour for UCS-2.

I don't know whether iconv's behaviour has changed recently or not. I suspect it has not. I further suspect that the reason iconv's behaviour changed from 3.6 to 3.7 is that it is using a different set of steps to perform the conversion from before.

I'm afraid I don't know the history. François made a lot of changes between 3.6 and his death, and I then made many further changes, both simplifying the code and applying various patches from others. Further, I did this work months ago, so I've already forgotten a lot. The easiest way to find out what changed would be to git bisect between 3.6 and 3.7 with the tests you did above. I'd be very grateful if you found out, as it's not something I have time to do.

Ditto for the situation with the configure default.

I'm not suggesting using an overlapping implementation. In 3.7 I removed all the recode implementations I could of conversions that iconv provides, except where I was unable to prove (in one or two obscure cases) which was correct.

Since you seem not to agree with my idea of changing the default to not use iconv, and since it would indeed be good if this change were avoidable, finding the actual cause of this problem would be a big help.

RalphCorderoy commented 6 years ago

I did already try a bisect between git tags v3.6 and v3.7, but unfortunately couldn't get v3.6 to build due to its old dependencies on auxiliary. I've tried digging about in v3.7, and have some more data, but no answer.

From ASCII works because a BOM is introduced internally that's later picked up inside get_ucs2(). Same happens from `latin1'.

$ printf A | src/recode -v -v ascii..dump
scan_charset(): alias->name: 'ASCII'
scan_charset(): alias->symbol->name: 'ANSI_X3.4-1968'
scan_charset(): alias->symbol->iconv_name: 'ANSI_X3.4-1968'
scan_charset(): alias->name: 'dump-with-names'
scan_charset(): alias->symbol->name: 'dump-with-names'
scan_charset(): alias->symbol->iconv_name: '(null)'
Request: ANSI_X3.4-1968..ISO-10646-UCS-2..dump-with-names
produce_full_dump
    get_ucs2 0xfe 0xff
        src/ucs.c RECODE_SWAP_UNDECIDED
    get_ucs2 0 0x41
        src/ucs.c RECODE_SWAP_NO
produce_full_dump 0x41
UCS2   Mne   Description

0041   A     latin capital letter a
$

There's no manufactured BOM from `utf-8' and so get_ucs2() sticks with RECODE_SWAP_UNDECIDED and returns the faulty 0x4100 to produce_full_dump().

rrthomas commented 6 years ago

Sorry, you already mentioned you couldn't build 3.6.