john-parton / chardetng-py

Simple python binding for rust chardetng library
MIT License
5 stars 1 forks source link

Encodings reported by chardetng-py don't always match up to python's decoding #11

Open john-parton opened 1 year ago

john-parton commented 1 year ago
LookupError: unknown encoding: windows-874

In a previous version, I just passed the entire buffer to encoding_rs and had it handle decoding entirely in rust, but that was a little confusing.

Need more robust aliases

john-parton commented 1 year ago

Actually it might be even more complicated than that. https://bugs.python.org/issue25416

I think it's possible that encoding_rs and the python codec will have different output for legacy single byte encodings

Mr0grog commented 11 months ago

For what it’s worth, I went through and:

  1. Compared the names/aliases. windows-874 is the only one chardetng can produce that does not map correctly in Python, so that’s already fully addressed.

  2. Wrote a quick script to compare the single byte WHATWG encodings definitions (found at https://encoding.spec.whatwg.org/#legacy-single-byte-encodings or more conveniently in git at https://github.com/whatwg/encoding) with the Unicode consortium definitions (found at https://unicode.org/Public/MAPPINGS/). They are all basically identical except for a few exceptions in KOI8-U and windows-1255:

    MATCH: IBM866
    MATCH: ISO-8859-2
    MATCH: ISO-8859-4
    MATCH: ISO-8859-5
    MATCH: ISO-8859-6
    MATCH: ISO-8859-7
    MATCH: ISO-8859-8
    MATCH: ISO-8859-13
    Definitions for KOI8-U do not match!
      Mismatch at byte 174 (0xae)! WHATWG = point 1118 (0x045e) | Unicode = point 9565 (0x255d)
      Mismatch at byte 190 (0xbe)! WHATWG = point 1038 (0x040e) | Unicode = point 9580 (0x256c)
    MATCH: windows-874
    MATCH: windows-1250
    MATCH: windows-1251
    MATCH: windows-1252
    MATCH: windows-1253
    MATCH: windows-1254
    Definitions for windows-1255 do not match!
      Mismatch at byte 202 (0xca)! WHATWG = point 1466 (0x05ba) | Unicode = point <UNDEFINED>
    MATCH: windows-1256
    MATCH: windows-1257
    MATCH: windows-1258

    (The format here is byte <decimal> (<hex>) is the actual byte being decoded and point <decimal> (<hex>) is the unicode code point it should decode as.)

    Caveat: I allowed points that are control characters to be treated as undefined for this comparison, which I think is probably reasonable. There are a bunch more encodings that don’t match up otherwise (i.e. a byte is defined as some control character in WHATWG’s definition and is entirely undefined/unmapped in the Unicode Consortium’s definition, or vice versa).

These mismatches are definitely not ideal, but I think are also small enough not to be a problem (at least for the use case here: detecting the encoding of some bytes; not actually decoding/encoding them). None of these bytes show up in chardetng’s frequency data in src/data.rs (assuming I’m reading it correctly), which I think means they aren’t really being considered anyway, so their mapping shouldn’t have a notable impact.

On the other hand, the windows-1255 case does feature a byte that could cause errors when [strictly] decoding in Python, and based on chardetng’s docs, I think it wouldn’t otherwise give you an answer that does that, since it says it discards options with decoding failures/unmapped bytes. That said, I think it’s good advice to avoid ever decoding a sniffed encoding strictly in the first place; IIRC chardet and cChardet both occasionally give answers that are wrong enough that they don’t successfully decode the bytes and code I had using them seems to always use errors="ignore" or errors="replace" when decoding with their guesses. (This might all be worth mentioning in some addendum to the the docs, though… idunno.)

Mr0grog commented 11 months ago

Ah, some historical details on the ones that differ between WHATWG and Unicode, if you’re curious:

john-parton commented 11 months ago

Thanks for looking into that. So it looks like if you use python to decode, it might not be correct. I think probably it's best to just document this and move on.

john-parton commented 10 months ago

I'm going to keep this issue open because it's lacking documentation, but it looks like the functional problems are well-addressed.

Mr0grog commented 10 months ago

Yes, good point! I should not have marked that PR as solving this.

I started looking into the multibyte encodings (I think I covered the single-byte ones well), and it seems like the situation is a bit more messy. I’m leaving some notes here so I or someone can document based on them.

Multibyte codecs that Chardetng detects:

GBK

See #149 for details. Basically, gb18030 is a superset of gbk, which is a superset of gb2312. Chardetng expects decoding with encoding_rs, which treats all of them the same (as gb18030). They are all different decoders in Python, though, so we should (and now do) transform GBK → gb18030, which will work as a decoder in Python for all [correctly detected] cases.

Shift_JIS

Oh dear, this is messy and complicated. Shift_JIS turns out to be a really old and limited spec that has been extended in a multitude of ways that are incompatible with each other.

⚠️ TL;DR: If Chardetng returns Shift_JIS, the right way to decode in Python is probably:

  1. Try decoding as ms932.
  2. If that fails, try decoding as shift_jis_2004 (which is a superset of all the others in this family).
  3. If that fails, do whatever your fallback action is for bad detection.

(Caveat: if you want/need to behave like a web browser, skip step 2).

I think it would be lovely if that behavior was bundled up into the detect() function here (but in a way where you could call it separately if you are using EncodingDetector directly), and maybe also use it to inform the confidence of compat.detect(), but I think you could also make the case that it’s out of scope. It also adds in a fair amount of overhead if you aren’t ultimately planning to decode the bytes you’re dealing with (a good reason to put it in detect() and not EncodingDetector).

⚠️ End TL;DR

The most popular versions of Shift_JIS are:

WHATWG (and Chardetng) treats Windows-932 as Shift_JIS. So technically if Chardetng detects Shift_JIS, it really means windows-932 (or ms932 in Python terms). BUT the detection is loose enough and the overlap in these encodings is big enough that will typically detect Shift_JISx0213 or Shift_JIS_2004 or Shift_JIS content as Windows-932 and give Shift_JIS as a result.

So if Chardetng returns Shift_JIS, the right way to decode in Python is probably:

  1. Try decoding as ms932.
  2. If that fails, try decoding as shift_jis_2004 (which is a superset of all the others in this family).
  3. If that fails, do whatever your fallback action is for bad detection.

(Caveat: if you want/need to behave like a web browser, skip step 2).

In an ideal world, I think that behavior would be bundled up into the detect() function here (but in a way where you could call it separately if you are using EncodingDetector directly), and maybe also use it to inform the confidence of compat.detect(), but I think you could also make the case that it’s out of scope. It also adds in a fair amount of overhead if you aren’t ultimately planning to decode the bytes you’re dealing with (a good reason to put it in detect() and not EncodingDetector).


I didn’t have a chance to dive into the other multibyte ones yet. Quick notes on EUC-KR and Big5, but these need more investigation:

EUC-KR

WHATWG’s and encoding_rs’s definition of EUC-KR is actually UHC/Windows-949, which is an extension on the original EUC-KR. I’m not sure whether that’s a strict superset (in which case this is straightforward and we should swap the name like we did for GBK/gb18030) or not, or any idiosyncracies in how Python treats the two (they are at least separate codecs in Python).

Big5

WHATWG’s and encoding_rs’s definition of Big5 is actually Big5-HKSCS. I’m not sure how much a strict superset this is, or anything else here. Needs more research.

ISO-2022-JP

Haven’t looked into at all yet.

EUC-JP

Haven’t looked into at all yet.

john-parton commented 10 months ago

Hm, I think writing up a condensed version of this--something like "these encodings tend to be difficult in general: "--and putting it in the docs might be sufficient.

It's obviously pretty tricky and I appreciate you looking into it.

If someone really wants to do what a browser might do, they should probably take the binary stream and pass it directly to encoding_rs. It's not too bad to add bindings. We could create and encoding_rs_py (lol) binding.

Mr0grog commented 10 months ago

That’s fair. Given that, I think we should probably also (in addition to the docs) rename Shift_JIS to ms932 (like was done with gbkgb18030) since that is really the closest thing in Python’s repertoire to what chardetng means.

I’m still hoping to write some docs for this, but want to do the research on the other remaining multibyte codecs here first.

Mr0grog commented 10 months ago

More research:

ISO-2022-JP

ℹ️ TL;DR: None of Python’s built-in encodings is quite equivalent to the WHATWG version of this, but iso2022_jp_ext is the closest (and fairly narrow) superset. For safe usage, it probably makes sense to map ISO-2022-JP to iso2022_jp_ext. ℹ️

ISO-2022-JP has a similarly complicated backstory to GBK with incompatible families of extensions. ISO-2022 is a structure for making a codec that is stateful and switches between different sub-encodings when it encounters certain escape sequences (e.g. when it encounters the bytes ESC ( J it switches to interpreting everything after as if it were the roman component of JIS X 0201-1976, and when it encounters ESC $ B it switches to interpreting everything after as if it were JIS X 0208-1983).

Note: JIS X NNNN is an encoding standard, and JIS X NNNN-YYYY is the version of that standard published in year YYYY.

So in Python, nothing is quite equivalent to the WHATWG version (which is what Chardetng is guessing when it says ISO-2022-JP). Straight-up iso2022_jp will fail to decode some things Chardetng thinks are OK, but iso2022_jp_ext is a strict superset of what Chardetng is looking for (the only superset!), so should always succeed. A bit of fiddling around with various characters that are only supported in some of the sub-encodings listed here seems to confirm this in practice (using C Python 3.12.0).

If we are going ahead with remapping names for compatibility, it probably makes sense to map ISO-2022-JPiso2022_jp_ext.

EUC-JP

ℹ️ TL;DR: As far as decoding goes WHATWG/chardet/encoding_rs’s concept of EUC-JP is the same as Python’s. We shouldn’t need to treat this one specially. ℹ️

EUC encodings are like ISO-2022 and Shift-JIS in that they contain several sub-encodings. Like ISO-2022, they support more/are a bit more flexible than Shift-JIS, but uses a switching mechanism based on what bytes are allowed where, like Shift-JIS, rather than statefully switching modes like ISO-2022. (If you’re familiar with UTF-8, it’s similar.)

For Japanese, there are three popular versions:

Neither the WHATWG standard nor encoding_rs/chardetng use anything from JIS X 0213, and pretty much exactly match behavior. We don’t need to do anything special for this family of encodings.


Still remaining:

john-parton commented 10 months ago

That all makes sense to me. I wonder if documenting it is sufficient, or if we should emit a warning of some kind.

By putting the mapping logic directly into the rust code, we have created a minor problem if the user wants to know what chardetng actually outputs. For instance, if they want to pass the output to a binding of encoding_rs.

Additionally, if we want to emit a warning using the python warnings module, I'm not sure we can really do that because the mapping is already done in rust.

I think perhaps deviating very slightly from the rust struct and adding a compat flag to the guess method might make sense

Mr0grog commented 10 months ago

By putting the mapping logic directly into the rust code, we have created a minor problem if the user wants to know what chardetng actually outputs. For instance, if they want to pass the output to a binding of encoding_rs.

So my feel on this is:

(Side note: FWIW, after having read up and thought about it more, I think Shift_JISms932 belongs more to the “exact match” category than the “safe superset” category. Also, Python’s canonical name for that is cp932, but both it and encoding_rs recognize ms932, so that one is probably better to use if we did a mapping.)

Some ideas for splitting the difference here…

  1. Add an argument to the EncodingDetector constructor that tells it whether to return Python or WHATWG names. For example:

    detector = EncodingDetector(use_python_names=True)
    detector.feed(some_ms932_bytes)
    detector.guess(tld=None, allow_utf8=False)
    # "ms932"
    
    detector = EncodingDetector(use_python_names=False)
    detector.feed(some_ms932_bytes)
    detector.guess(tld=None, allow_utf8=False)
    # "whatwg-shift_jis" or "shift_jis" or "Shift_JIS"

    I think .detect() and compat.detect(), which go for simplicity, would set this to use Python names, but the default could be to use WHATWG/encoding_rs names.

    (Edit: I think your suggestion to add this as an argument to .guess()/.guess_assess() is probably better than the constructor idea I proposed here. The only bonus with the constructor is that it leaves the method signatures cleanly matching chardetng’s.)

  2. Always return a WHATWG name, but provide a way to register the equivalent Python decoders for it:

    chardetng_py.detect(some_ms932_bytes)
    # "whatwg-shift_jis"
    
    chardetng_py.register_loose_decoders()
    some_ms932_bytes.decode("whatwg-shift_jis")
    # Works by using Python’s `ms932` codec.
  3. Make detect() and compat.detect() use Python-compatible names and EncodingDetector use WHATWG names. Provide the transform as a Python function someone can use with the result from EncodingDetector if they want:

    chardetng_py.detect(some_ms932_bytes)
    # "ms932"
    
    detector = chardetng_py.EncodingDetector()
    detector.feed(some_ms932_bytes)
    result = detector.guess(tld=None, allow_utf8=False)
    # "whatwg-Shift_JIS"
    
    chardetng_py.compatible_codec_name(result)
    # "ms932"

(Edit: updated with a note on idea 2 and added idea 3, which I originally added as a separate comment: https://github.com/john-parton/chardetng-py/issues/11#issuecomment-1802660449. Also fixed some typos where I wrote ms939 instead of ms932)

john-parton commented 10 months ago

Interesting, either way. I'll need to think about it for a while.

Mr0grog commented 10 months ago

Oops, forgot a third: have detect() and compat.detect() return Python compatible names, and EncodingDetector return WHATWG names. Provide the transform as a Python function someone call with the result from EncodingDetector.guess() if they want.

Mr0grog commented 10 months ago

Alright, wrapping up with the final two:

EUC-KR

ℹ️ TL;DR: As far as decoding goes, WHATWG/chardet/encoding_rs’s concept of EUC-KR is equivalent to Python’s uhc/ms949/cp949. Python’s euc-kr is a subset (sort of) and not safe to consider equivalent. ℹ️

EUC-KR is structured like EUC-JP, but just uses different sub-encodings. The original version had some major drawbacks, so both Microsoft and Apple developed their own extended versions:

This one is ultimately pretty simple. WHATWG’s EUC-KR is the same as Python’s uhc or ms949. Unfortunately there is no alias that is commonly supported in both Python’s built-ins and encoding_rs (similar to the situation with cp874).

Big5

ℹ️ TL;DR: As far as decoding goes, WHATWG/chardet/encoding_rs’s concept of Big5 is a mix of the Big5-HKSCS extension (Python: big5-hkscs) and Windows-950 (Python: cp950). There is no equivalent or superset in Python, although big5-hkscs is closest (note this label, with the - works in both encoding_rs and Python) even though it is an older version of HKSCS and is missing a bunch of characters supported in the WHATWG version. ℹ️

The backstory on Big5 is pretty complex! Big5 was created in a somewhat ad-hoc way by by various computer manufacturers in Taiwan, and was eventually standardized for interoperability. It was still pretty limited, though, so there there is a whole mess of more specialized encodings that extend it. Some popular branches of the tree of extensions:

Since both Windows-950 and HKSCS were quite popular, WHATWG wound up standardizing on a combination of the two. It seems like it is basically HKSCS-2016 + any additional byte sequences that don’t work in it but do in Windows-950. This basically works out to HKSCS + 12 characters:

Bytes "0xa1 0xc2" = U+00AF ("¯" MACRON)
Bytes "0xa1 0x45" = U+2027 ("‧" HYPHENATION POINT)
Bytes "0xa3 0xe1" = U+20AC ("€" EURO SIGN)
Bytes "0xa2 0x41" = U+2215 ("∕" DIVISION SLASH)
Bytes "0xa1 0xf2" = U+2295 ("⊕" CIRCLED PLUS)
Bytes "0xa1 0xf3" = U+2299 ("⊙" CIRCLED DOT OPERATOR)
Bytes "0xa1 0x4e" = U+FE51 ("﹑" SMALL IDEOGRAPHIC COMMA)
Bytes "0xa2 0x42" = U+FE68 ("﹨" SMALL REVERSE SOLIDUS)
Bytes "0xa1 0xe3" = U+FF5E ("~" FULLWIDTH TILDE)
Bytes "0xa2 0x46" = U+FFE0 ("¢" FULLWIDTH CENT SIGN)
Bytes "0xa2 0x47" = U+FFE1 ("£" FULLWIDTH POUND SIGN)
Bytes "0xa2 0x44" = U+FFE5 ("¥" FULLWIDTH YEN SIGN)

Unfortunately, there is nothing like this built-in for Python. First off, the big5-hkscs codec is based on the 2004 standard, while WHATWG is based on the 2016 version (92 more characters). You could probably handle this in Python by decoding as big5hkscs and using a custom error handler that handles the above sequences and the missing characters, but that’s not great. The right way to think about this is probably that it means “could be big5hkscs or cp950,” since I think what WHATWG was trying to do here is make a decoder that kinda sorta works for for both (even though you get somewhat messy results for a lot of Windows-950 text, it works for most characters).

Anyway!

(Edit on 2023-11-13: Rewrote the section on Big5 when I ran into some edge cases today. It’s now much more accurate and detailed.)

Mr0grog commented 10 months ago

Overall summary of encodings and their support/similarities/differences between WHATWG/encoding_rs and Python’s built-in codecs:

Single-byte? WHATWG Name Python Builtin Equivalent Notes
Yes IBM866 " ✅ Yes
Yes ISO-8859-2 " ✅ Yes
Yes ISO-8859-4 " ✅ Yes
Yes ISO-8859-5 " ✅ Yes
Yes ISO-8859-6 " ✅ Yes
Yes ISO-8859-7 " ✅ Yes
Yes ISO-8859-8 " ✅ Yes
Yes ISO-8859-13 " ✅ Yes
Yes KOI8-U " ❌ No WHATWG’s version is actually KOI8-RU, which Python does not have built-in support for. 2 mismatched bytes. 0xAE is “ў” in WHATWG and “╝” in Python. 0xBE is “Ў” in WHATWG and “╬” in Python. There is no better matching codec in Python.
Yes windows-874 cp874 ✅ Yes Different name, but exact same results.
Yes windows-1250 " ✅ Yes
Yes windows-1251 " ✅ Yes
Yes windows-1252 " ✅ Yes
Yes windows-1253 " ✅ Yes
Yes windows-1254 " ✅ Yes
Yes windows-1255 " ❌ No 1 mismatched byte. 0xCA is U+05BA [HEBREW POINT HOLAM HASER FOR VAV] in WHATWG and undefined in Python.
Yes windows-1256 " ✅ Yes
Yes windows-1257 " ✅ Yes
Yes windows-1258 " ✅ Yes
No GBK gb18030 ✅ Yes This Python name is an alias that works in both Python and encoding_rs.
No Big5 big5-hkscs ❌ No This is the closest equivalent in Python, but it’s not a safe superset and is missing 104 characters that work in the ultra-weird WHATWG version.
No Shift_JIS ms932 ✅ Yes This Python name is an alias that works in both Python and encoding_rs. These are basically the same, although there is a little complexity on the Python side, where a couple characters decode differently depending on the STRICT_BUILD flag that is set at build-time on Python.
No ISO-2022-JP iso2022_jp_ext ❌ No The Python codec handles a superset of the WHATWG encoding here. Some bytes that would fail in chardetng/encoding_rs will decode OK with it in Python.
No EUC-JP " ✅ Yes
No EUC-KR uhc ✅ Yes Different name, but exact same results.

Notes:

  1. This treats any control characters as equivalent even if one of the codecs doesn't support them. You may have to decode and ignore/skip errors to get exactly matching output if you are handling bytes that include control characters.
  2. " denotes names that are the same as the matching built-in codec in Python. The canonical names of them in Python are a little different, though, e.g. ISO-8859-2iso8859-2.

(Edit on 2023-11-13: Updated the entry for Big5. I wound up tripping over the edge cases on it today and discovered it it’s the weirdest one here; WHATWG just sort of slammed two similar encodings together for it. I’ve updated the earlier, detailed comment for Big5 with more info.)

john-parton commented 10 months ago

This is overkill in the best possible terms. I definitely think it makes sense to document this behavior. I would almost immediate accept a PR that updates the docs with some information on the more difficult charsets.

Mr0grog commented 10 months ago

I’m happy to try and wrap this up into something more condensed and readable, but I think it would be good to figure out the actual strategy for what (if any) names are getting remapped under what circumstances first (don’t need to have implemented it yet).

Mr0grog commented 10 months ago

Quick update: I ran into some weird behavior with Big5 today and had more energy to dive into the details. It turns out to be kinda weird! I updated my detailed comments above if you want to know more, but the short version is: it’s the only one where there is not a clear equivalent/safe superset in Python’s built-in codecs. big5-hkscs, which I’d recommended as equivalent before, is still close, but less close than I’d thought.