Codecs package - Githubissues

harjitmoe commented 3 years ago

I think I've gotten this to the level that I am willing to PR this, at any case.

For text encodings, I have tried to implement all WHATWG encodings, plus some more, partly though not entirely to attain near‑parity with the labels for text encodings supported by Python (though ISO-2022-CN, ISO-2022-CN-EXT and IBM-1364 are additional to that). The text encodings not supported are unicode-escape, raw-unicode-escape, idna and punycode (while Punycode may be important for URLs, it is very confusing and I have no real motivation to try to understand it). Additionally, x-mac-korean and x-mac-japanese labels are not supported (Python recognises them, but only as aliases to euc-kr and shift_jis respectively).

I have generally tried to follow WHATWG where applicable, but deviating in places where strictly following WHATWG seemed non-sensible for a non-browser (e.g. I'm allowing encoding to HKSCS when the HKSCS label is used) or otherwise inconsistent (e.g. I'm pedantising nil-effect escape sequences (not only adjacent ones) in ISO-2022-JP, and excluding the HKSCS additions following, not only preceding, the Big5-ETEN range when encoding to Big5). This does mean the behaviour for certain labels is not exactly the same as Python's; for instance, shift_jis, ms-kanji and windows-31j all refer to the Microsoft version per WHATWG, as opposed to Python associating them with the UTC mapping, Microsoft version and nothing, respectively. For the most part, the WHATWG behaviour is more sensible in practice, given the behaviour of other, non-Python, implementations.

This does not quite exactly match Python in API respects either, in that (for example) the registration of codec lookup is very different as a differing design decision, there is no support for stream readers / stream writers (yet), and the .encode and .decode methods on strings are not changed (the existing ones are actually used, on valid substrings, by the UTF-8 codec). But when just using codecs.encode and codecs.decode, or even codecs.lookup to get incremental classes, and not making any assumptions about the object returned/accepted by getstate/setstate, it should match Python behaviour in API terms.

Binary-to-text encodings (namely Base64 and Quoted-Printable, and a "Base64UU" which is used in chunks by uuencode, but is not itself uuencode) I have actually implemented with backward semantics (and prefixed labels with inverse-): being created with decode and parsed with encode, so as to try and allow decode to be consistently bytes→str and encode to be consistently str→bytes. Python actually had some problems with its less type-consistent approach causing problems in contexts where externally supplied encoding labels may occur, eventually having to explicitly mark codecs that were not text encodings (e.g. were binary-to-text or compression codecs) to exclude them from the string-method/bytes-method versions of encode and decode, so I think this is somewhat justified, despite being in some ways more confusing. That being said, the jury is still out to a certain extent since they tend to ignore the error mode in favour of just raising the exception, since a binary-to-text encoding cannot necessarily recover from an unrecognised or invalid sequences in the same way that a text encoding can.

klange commented 3 years ago

I think it might be best to not commit the generated files, as they amount to a few megabytes of space and the generator isn't as slow as it was earlier in the project - dbdata takes about 7 seconds (we have a test suite that takes longer), sbencs is under 1s.

I'd also like to squash the branch before merging.

harjitmoe commented 3 years ago

I believe Github has an enablable feature to squash-merge PRs, which tends to be cleaner than trying to squash them before merging in my experience.

klange commented 3 years ago

Yes, it's an option I have in the merge menu.

harjitmoe commented 3 years ago

Right, so with the generated files excluded, that saves 3.0 MiB from dbdata.krk and a much smaller 85.9 KiB from sbencs.krk.

It is worth noting that if I drop ISO-2022-CN-EXT, I could save a further 1.3 MiB (that would involve deleting dbextra_data_7bit_cnext.krk, and line 8 and lines 1263–1280 from dbextra.krk). I had included it for completeness; however, ISO-2022-CN-EXT is often left unsupported due to being rarely used and requiring a quite large number of mapping tables compared to the other 7-bit ISO 2022 profiles, hence I've kept the tables which only ISO-2022-CN-EXT uses in a separate file.

kuroko-lang / kuroko

Codecs package #4