rrthomas / recode

Charset converter tool and library
GNU General Public License v3.0
130 stars 12 forks source link

"flat" charset (ASCII without diacritics) isn't working #54

Open fhanzlik opened 4 months ago

fhanzlik commented 4 months ago

What was working in oldier Pinard recode versions is not working now: ` echo "růžička"|recode -f u8..flat

rika ` (instead of right result "ruzicka")

rrthomas commented 4 months ago

Sorry, indeed flat does not work as before. I'm not sure quite what the answer is; it seems to be complicated.

However, I can offer a workaround in the mean time:

echo "růžička"|recode -f u8..iso-8859-1-translit,iso-8859-1..flat
ruzicka

The second step iso-8859-1..ascii-bs is needed because accented characters that can be represented in ISO-8859-1 will still be present after the first step. So:

echo "érůžička"|recode -f u8..iso-8859-1-translit
�rruzicka

whereas

echo "érůžička"|recode -f u8..iso-8859-1-translit,iso-8859-1..flat
eruzicka

I think the solution to this bug is to make a converter from UTF-8 to ASCII-BS (rather than from Latin-1 to ASCII-BS as at present). This would avoid the need for the -translit step, without adding extra magic. (In Recode 3.6, transliteration is always tried if non-transliterated conversion fails. This means that Recode's behaviour can change according to its input.)

fhanzlik commented 4 months ago

Hi Thomas, thank for your interest in this issue, and yes - your solution work well!

rrthomas commented 4 months ago

There is a much easier workaround: use ASCII-translit instead of flat:

echo "érůžička"|recode -f u8..ascii-translit
eruzicka
fhanzlik commented 2 months ago

Thomas thanks - I'm now using this conversion format.

rrthomas commented 2 months ago

I'll keep this open as a placeholder, because something needs to happen with flat; I'm just not sure what yet.