logological / eoconv

Converts text between Esperanto encodings
https://logological.org/eoconv
GNU General Public License v3.0
2 stars 0 forks source link

UTF-8-encoded text is corrupted #1

Open logological opened 7 years ago

logological commented 7 years ago

Dmitry Bogatov reports that eoconv corrupts UTF-8-encoded text:

$ cat input.txt
Hello!
Привет!
Saluxton!
$ eoconv --from post-x --to utf-8 input.txt
Hello!
�Ñ�¸�²�µÑ!
Salŭton!
logological commented 7 years ago

On reviewing the code and documentation, the code seems to be working (or not working, as the case may be) as intended. The post-x "encoding" assumes ASCII input, not UTF-8, and is clearly documented as such. So what we are seeing here is a case of GIGO.

That said, there is a potential use case for being able to convert from UTF-8-encoded text that uses {pre,post}-{x,h,caret} transliteration, or HTML entities. The problem is that eoconv conflates transliteration schemes with computer character encodings. The proper solution is to allow the user to separately specify the input and output transliteration schemes, and the input and output character encodings.