rrthomas / recode

Charset converter tool and library
GNU General Public License v3.0
130 stars 12 forks source link

UTF-32BE to UTF-8 conversion #35

Closed BeyondMagic closed 2 years ago

BeyondMagic commented 2 years ago

What's the expected input? Most sites that I can find use something like...

$: echo "U00003072" | recode UTF-32BE..UTF-8
recode: Invalid input in step 'UTF-32BE..UTF-8'
$: echo "U 00 00 30 72" | recode UTF-32BE..UTF-8
recode: Invalid input in step 'UTF-32BE..UTF-8'
$: echo "00 00 30 72" | recode UTF-32BE..UTF-8
recode: Invalid input in step 'UTF-32BE..UTF-8'
$: echo "00003072" | recode UTF-32BE..UTF-8
recode: Invalid input in step 'UTF-32BE..UTF-8'
$: echo "は" | recode UTF-32BE..UTF-8
recode: Invalid input in step 'UTF-32BE..UTF-8'

... but nothing works, as you can see.

BeyondMagic commented 2 years ago

It seems it works only with the binary.

rrthomas commented 2 years ago

I'm a bit confused by this issue. Recode changes character encodings. UTF-32BE is "big endian UTF-32". That means that when you specify that as the input encoding, recode will expect big-endian UTF-32 as input. Am I missing something?

By the way, you can easily find out what sort of input is expected by reversing the recoding:

$ echo "は" | recode UTF-8..UTF-32 | hd
00000000  ff fe 00 00 6f 30 00 00  0a 00 00 00              |....o0......|
0000000c
BeyondMagic commented 2 years ago

Yes, thanks, I realized that only a few hours later I opened the issue. The input I was putting is Unicode UTF32-BE, not UTF32-BE binary.

Anyway, I didn't figure out how to convert this Unicode, so I am using some hack way for now.

rrthomas commented 2 years ago

Indeed, the Java encoding allows Unicode escapes to be used, but only for 4 hex digits.