Closed jose1711 closed 2 years ago
Thanks for the report. Recode 3.7 has most, perhaps all, of the patches from Debian; at least, I applied all those that seemed to be needed on top of the other changes I made.
As far as I can tell from reading the documentation, this is a bug in recode 3.6, or at least in Debian's version. The characters given do not exist in ISO-8859-1, and recode should not silently mistranslate them.
There is a side-effect. In Debian I am able to run echo 'ľščť' | recode -v utf8..flat
yielding lsct
. Same effect can be had using iconv -f utf8 -t ascii//translit
. On Arch though I can't find a way to do this using recode
.
As far as I can see, the problem arises in this case because there is no direct conversion in recode from UTF-8 to flat; it must go via ISO-8859-1. So for example, the following works:
$ echo 'éáà' | recode -v utf8..flat
Request: UTF-8..:iconv:..ISO-8859-1..ASCII-BS..flat
Shrunk to: UTF-8..ISO-8859-1..ASCII-BS..flat
eaa
while your example yields empty output.
Enlarging the ASCII-BS encoding to handle more of UTF-8 (and hence presumably converting lat1asci.c
to utf8asci.c
and ascilat1.c
to asciut8.c
would fix this.
The fact that this problem can be solved with iconv does not seem to make it so urgent to make it possible in recode.
I've had a deeper look into the code. The change in behaviour with iconv seems to have come in commit 1cdee3a in 2008 (but after version 3.6), when François Pinard rewrote iconv support (removing the in-tree libiconv and instead using external iconv), and removed the use of transliteration.
Have a look at branch iconv-translit
if you like. You can say, for example:
$ echo 'ľščť' | src/recode -v utf8..iso-8859-1-translit
Request: UTF-8..:iconv:..ISO-8859-1-TRANSLIT
Shrunk to: UTF-8..ISO-8859-1-TRANSLIT
lsct
The implementation is rather simple-minded: it just adds a "//TRANSLIT" encoding for each normal encoding. This doesn't work with aliases, and in any case it feels like it should be more like a surface (although it's not really a surface), and behave like in 3.6, so for example respect --strict
. Still, it's a proof of concept.
I can find no evidence that François Pinard deliberately removed transliteration ability, and I think it's possible to restore it.
$ echo 'ľščť' | src/recode -v utf8..iso-8859-1-translit Request: UTF-8..:iconv:..ISO-8859-1-TRANSLIT Shrunk to: UTF-8..ISO-8859-1-TRANSLIT lsct
The implementation is rather simple-minded: it just adds a "//TRANSLIT" encoding for each normal encoding. This doesn't work with aliases, and in any case it feels like it should be more like a surface (although it's not really a surface), and behave like in 3.6, so for example respect
--strict
. Still, it's a proof of concept.
Thank you, this works for me. Although the ability to use ascii-translit
would be nice as well (there is csascii-translit
but I don't really know what csascii
is :-))
Thanks for the confirmation. I would like to think more about the correct design and user interface for this before adding it to a release.
Is this issue possibly responsible for the strange locale mistranslation?
$ echo 'foo & bar' | recode -v ..html
Request: UTF-8..:iconv:..ISO-10646-UCS-2..HTML_4.0
Shrunk to: UTF-8..ISO-10646-UCS-2..HTML_4.0
昀漀漀 ☀ 戀愀爀਀
$ echo 'foo & bar' | LC_ALL=C recode -v ..html
Request: ANSI_X3.4-1968..ISO-10646-UCS-2..HTML_4.0
foo & bar
@Earnestly quite possibly. See #8:
$ echo 'foo & bar' | recode -x: -v ..html
Request: UTF-8..ISO-10646-UCS-4..UTF-16..ISO-10646-UCS-2..HTML_4.0
foo & bar
Ah, thank you very much for the workaround and quick reply; that does work here as well. When I noticed it using UCS2 I did try offering ISO-10646-UCS-4
but was told recode: Ambiguous output in step 'ISO-10646-UCS-2..HTML_4.0'
(Request: ISO-10646-UCS-4..UTF-16..ISO-10646-UCS-2..HTML_4.0
).
This is a Debian 10 (Buster):
and this is Arch Linux:
Now I understand that Debian maintainers patched 3.6 version heavily but still would like to learn where the inconsistency is coming from.