Codecs revisited - Githubissues

harjitmoe commented 2 years ago

Since I haven't done anything on this for several days (x-mac-japanese would be interesting, since it has some of the mapping issues of x-mac-korean plus the fact that it is (on the conservative side) two different encodings in a trenchcoat—so I should probably keep that one on the backburner for now), I should probably PR what I have. This is basically several things I've been wanting to address for some time now, the main ones are:

Improvements to xraydict to accept a filter function as well as the exclusion list. This means that long hardcoded lists of exclusions (for encodings defined by their difference from other similar encodings) are less necessary.
Seven 1978 JIS (JIS C 6226-1978) mappings have been revised. Five of these (蝉蟬, 騨驒, 箪簞, 剥剝, 屏屛) take into consideration disunifications in 2000 JIS (JIS X 0213-2000) and 2004 JIS—i.e. where the 1978 character actually corresponded to a different (usually less simplified) character in the 2004 standard and should be mapped to Unicode as such—while previously they only followed disunifications made in 1990 JIS (JIS X 0208-1990 with JIS X 0212-1990). The other two (昻 vs 昂) are swapped between a standard position and a position in the NEC Selection of IBM Extensions, since this is apparently closer to the 1978 revision, and is indeed one of the swaps between the "old" (partly 1978 based) and "new" (fully 1983+ based) JIS sequences as implemented by IBM. These have minor effects on the jis_encoding codec (and therefore also the decoding behaviour for the ISO-2022-JP family except for iso-2022-jp itself, but only when the older ESC $ @ rather than ESC $ B appears in input).
Speaking of ISO-2022-JP, I have added a documentation section explaining how the two decoders' response to sequences unlikely to be generated by a single encode operation differs from the UTR#36/WHATWG approach, the Python approach, and the two "end states" of UTC L2/20-202. I have not changed this part of their behaviour, only documented it.
An x-mac-korean codec. This brings the number of Python's “temporary mac CJK aliases, will be replaced by proper codecs in 3.1” (which never were and still bear that notice, lol) with (by contrast) proper Kuroko support up to three out of four. Of all legacy Macintosh encodings, MacKorean is easily the one with the largest number of characters that don't exist in Unicode (all of them exist in Adobe-Korea though, although not Adobe-KR). I have deliberately deviated from the three Apple and one Adobe mappings (some partial, some with kludge mappings) I have for them to ① take advantage of closer (usually newer) Unicode representations, ② avoid decoding non-ASCII to sequences with non‑alphanumeric ASCII substrings, since they could be syntactically significant, ③ generally avoid using Apple's Corporate Private Use Area, at the expense of roundtripping.
The johab-ebcdic decoder is likewise changed to avoid using IBM's Corporate Private Use Area, at the expense of roundtripping.

harjitmoe commented 2 years ago

Withdrawing review request while I fix something I just noticed.

harjitmoe commented 2 years ago

Okay, done.

kuroko-lang / kuroko

Codecs revisited #28