rrthomas / recode

Charset converter tool and library
GNU General Public License v3.0
130 stars 12 forks source link

Conversion from java to utf-8 fails for certain characters. #34

Closed gitterrost4 closed 2 years ago

gitterrost4 commented 2 years ago

When trying to recode a java-encoded file containing \u00dc (which corresponds to the character Ü) from java to utf-8, it fails at the step utf16..utf8.

Steps to reproduce:

  1. Create a file containing just "\u00dc": echo '\u00dc' > myfile.
  2. Issue recode -v java..utf8 myfile.
  3. See it fail with Recoding myfile... failed: Invalid input in step 'UTF-16..UTF-8'.

The same thing happens if you create a file containing "Ü", recoding it from UTF8..java (this works) and then back again (this fails).

A workaround is rerouting over ISO-10646-UCS-2, which apparently was the default for UTF16..UTF8 in recode 3.6.

recode -v java..ISO-10646-UCS-2,ISO-10646-UCS-2..UTF8 myfile

Ü was the only character I could find that would fail here. ÄÖäöüß all work fine.

rrthomas commented 2 years ago

I can reproduce your problem, and fix it by running the command with the extra flag -x:. I will try to find time to fix #8 soonish.

rrthomas commented 2 years ago

Duplicate of #8