rrthomas / recode

Charset converter tool and library
GNU General Public License v3.0
130 stars 12 forks source link

(recode 3.6) windows-1252: U+017E LATIN SMALL LETTER Z WITH CARON is at byte 0x9e, not byte 0x8f #31

Closed sblondeel closed 3 years ago

sblondeel commented 3 years ago

Observed on recode 3.6 (Debian stable).

NB: I tried to reproduce this with bleeding edge recode but compilation fails at this step:

make[3]: Entering directory '/tmp/recode/po'
rm -f be.gmo && : -c --statistics --verbose -o be.gmo be.po
mv: cannot stat 't-be.gmo': No such file or directory
make[3]: *** [Makefile:164: be.gmo] Error 1
make[3]: Leaving directory '/tmp/recode/po'
make[2]: *** [Makefile:202: stamp-po] Error 2
make[2]: Leaving directory '/tmp/recode/po'
make[1]: *** [Makefile:1509: all-recursive] Error 1
make[1]: Leaving directory '/tmp/recode'

Wikipedia and various other online resources think U+017E is at byte 0x9e:

https://en.wikipedia.org/wiki/Windows-1252#Character_set

However, recode 3.6 thinks this character is at byte 0x8f and byte 0x9e is invalid:

$ perl -e 'print "\x8f\n"' | recode windows-1252..html
ž
$ perl -e 'print "\x9e\n"' | recode windows-1252..html
recode: Untranslatable input in step `CP1252..ISO-10646-UCS-2'

This document

https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

found on the IANA website:

https://www.iana.org/assignments/charset-reg/windows-1252

has the following regarding those two bytes:

0x8F            #UNDEFINED
0x9E    0x017E  #LATIN SMALL LETTER Z WITH CARON

so I would be tempted to believe recode 3.6 is wrong on this.

Regards,

rrthomas commented 3 years ago

Sorry, I can't help with 3.6. I hope Debian will upgrade soon!

To build the latest sources, try turning off parallel make flags. (At least, I have MAKEFLAGS set to something like -j 4 by default, and recode fails to build; unsetting it fixes that.)

With recode 3.6 I get the same results as you.

With recode 3.7.8:

$ perl -e 'print "\x8f\n"' | recode windows-1252..html
ž
$ perl -e 'print "\x9e\n"' | recode windows-1252..html
recode: Ambiguous output in step `ISO-10646-UCS-2..HTML_4.0'
$ perl -e 'print "\x8f\n"' | recode windows-1252..utf-8
recode: Invalid input in step `CP1252..UTF-8'
$ perl -e 'print "\x9e\n"' | recode windows-1252..utf-8
ž

This is a bit odd, as it seems to do what you say is correct when the output is utf-8, but it also does what you say is wrong when the output is HTML.

So it seems that recode 3.7.8 does the right thing for UTF-8 output, but the wrong thing for HTML output. I shall investigate.

rrthomas commented 3 years ago

I believe that the correct translation in some cases in recode 3.7.8 is due to its default use of iconv (which has the correct encoding for CP1252, presumably). The built-in encoding seems to be wrong. I am loth to change anything in the tables without careful checks, so I'll double-check first!

sblondeel commented 3 years ago

Thanks for your reactivity. I understand your reluctance to tweak the tables. I am a bit surprised at the inconsistency depending on the backend; maybe there are others to detect?

Meanwhile I was trying to build recode bleeding edge with your tip, unsuccessfully so.

sblondeel@debian10:/tmp/recode$ MAKEFLAGS='' make
sblondeel@debian10:/tmp/recode$ make -j 1

both fail at the same step as previously. I don't see any MAKEFLAGS set in the Makefile or -j hanging around... This is, as hinted by the prompt, on a Debian10 (10.9). The fact it is running under Oracle Virtual Box should not matter?

Packages versions on Debian seem to be clogged up by the pending release of Debian 11. recode 3.7 is not in the pipeline yet:

https://packages.debian.org/search?keywords=recode&searchon=names&suite=all&section=all

Regards,

rrthomas commented 3 years ago

Having checked the sources you mention, I agree there's an error in recode's CP1252 table, and I'll fix it.

For building, I suspect you're missing msgfmt: look here:

rm -f be.gmo && : -c --statistics --verbose -o be.gmo be.po

The : command should be msgfmt.

rrthomas commented 3 years ago

3.7.9 released with fix.