Restore transliteration by iconv - Githubissues

rrthomas / recode

Charset converter tool and library

GNU General Public License v3.0

130 stars 12 forks source link

Restore transliteration by iconv #24

Closed jose1711 closed 2 years ago

jose1711 commented 4 years ago

This is a Debian 10 (Buster):

$ recode --version
Free recode 3.6
Written by Franc,ois Pinard <pinard@iro.umontreal.ca>.

Copyright (C) 1990, 92, 93, 94, 96, 97, 99 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ echo 'ľščť' | recode -v utf8..iso-8859-1
Request: UTF-8..:libiconv:..ISO-8859-1
Shrunk to: UTF-8..ISO-8859-1
lsct

and this is Arch Linux:

$ recode --version

recode 3.7.6
Written by François Pinard <pinard@iro.umontreal.ca>.

Copyright (C) 1990-2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ echo 'ľščť' | recode -v utf8..iso-8859-1
Request: UTF-8..:iconv:..ISO-8859-1
Shrunk to: UTF-8..ISO-8859-1
recode: Untranslatable input in step `UTF-8..ISO-8859-1'

Now I understand that Debian maintainers patched 3.6 version heavily but still would like to learn where the inconsistency is coming from.

rrthomas commented 4 years ago

Thanks for the report. Recode 3.7 has most, perhaps all, of the patches from Debian; at least, I applied all those that seemed to be needed on top of the other changes I made.

As far as I can tell from reading the documentation, this is a bug in recode 3.6, or at least in Debian's version. The characters given do not exist in ISO-8859-1, and recode should not silently mistranslate them.

jose1711 commented 4 years ago

There is a side-effect. In Debian I am able to run echo 'ľščť' | recode -v utf8..flat yielding lsct. Same effect can be had using iconv -f utf8 -t ascii//translit. On Arch though I can't find a way to do this using recode.

rrthomas commented 4 years ago

As far as I can see, the problem arises in this case because there is no direct conversion in recode from UTF-8 to flat; it must go via ISO-8859-1. So for example, the following works:

$ echo 'éáà' | recode -v utf8..flat
Request: UTF-8..:iconv:..ISO-8859-1..ASCII-BS..flat
Shrunk to: UTF-8..ISO-8859-1..ASCII-BS..flat
eaa

while your example yields empty output.

Enlarging the ASCII-BS encoding to handle more of UTF-8 (and hence presumably converting lat1asci.c to utf8asci.c and ascilat1.c to asciut8.c would fix this.

The fact that this problem can be solved with iconv does not seem to make it so urgent to make it possible in recode.

rrthomas commented 4 years ago

I've had a deeper look into the code. The change in behaviour with iconv seems to have come in commit 1cdee3a in 2008 (but after version 3.6), when François Pinard rewrote iconv support (removing the in-tree libiconv and instead using external iconv), and removed the use of transliteration.

rrthomas commented 4 years ago

Have a look at branch iconv-translit if you like. You can say, for example:

$ echo 'ľščť' | src/recode -v utf8..iso-8859-1-translit
Request: UTF-8..:iconv:..ISO-8859-1-TRANSLIT
Shrunk to: UTF-8..ISO-8859-1-TRANSLIT
lsct

The implementation is rather simple-minded: it just adds a "//TRANSLIT" encoding for each normal encoding. This doesn't work with aliases, and in any case it feels like it should be more like a surface (although it's not really a surface), and behave like in 3.6, so for example respect --strict. Still, it's a proof of concept.

rrthomas commented 4 years ago

I can find no evidence that François Pinard deliberately removed transliteration ability, and I think it's possible to restore it.

jose1711 commented 4 years ago

$ echo 'ľščť' | src/recode -v utf8..iso-8859-1-translit
Request: UTF-8..:iconv:..ISO-8859-1-TRANSLIT
Shrunk to: UTF-8..ISO-8859-1-TRANSLIT
lsct
The implementation is rather simple-minded: it just adds a "//TRANSLIT" encoding for each normal encoding. This doesn't work with aliases, and in any case it feels like it should be more like a surface (although it's not really a surface), and behave like in 3.6, so for example respect --strict. Still, it's a proof of concept.

Thank you, this works for me. Although the ability to use ascii-translit would be nice as well (there is csascii-translit but I don't really know what csascii is :-))

rrthomas commented 4 years ago

Thanks for the confirmation. I would like to think more about the correct design and user interface for this before adding it to a release.

Earnestly commented 3 years ago

Is this issue possibly responsible for the strange locale mistranslation?

$ echo 'foo & bar' | recode -v ..html
Request: UTF-8..:iconv:..ISO-10646-UCS-2..HTML_4.0
Shrunk to: UTF-8..ISO-10646-UCS-2..HTML_4.0
&#26112;&#28416;&#28416;&#8192;&#9728;&#8192;&#25088;&#24832;&#29184;&#2560

$ echo 'foo & bar' | LC_ALL=C recode -v ..html
Request: ANSI_X3.4-1968..ISO-10646-UCS-2..HTML_4.0
foo &amp; bar

rrthomas commented 3 years ago

@Earnestly quite possibly. See #8:

$ echo 'foo & bar' | recode -x: -v ..html
Request: UTF-8..ISO-10646-UCS-4..UTF-16..ISO-10646-UCS-2..HTML_4.0
foo &amp; bar

Earnestly commented 3 years ago

Ah, thank you very much for the workaround and quick reply; that does work here as well. When I noticed it using UCS2 I did try offering ISO-10646-UCS-4 but was told recode: Ambiguous output in step 'ISO-10646-UCS-2..HTML_4.0' (Request: ISO-10646-UCS-4..UTF-16..ISO-10646-UCS-2..HTML_4.0).