Don't Cursed Open Inside

ThePhD commented 3 years ago

This is a running list of all the (mildly to extremely) cursed encodings, and whether or not we should implement them. More can be suggested on Twitter here Here goes:

[ ] UNIVAC encoding (Section 1, part 2, PDF page 9 http://www.bitsavers.org/pdf/univac/418/UP-7599r1_rtosAssemb_Jul70.pdf)
[x] ISO 8859-1
[x] ISO 8859-15
[ ] ISO/IEC 2022 Encodings (https://en.wikipedia.org/wiki/ISO/IEC_2022)
[ ] ISO/IEC 646 Encodings (https://en.wikipedia.org/wiki/ISO/IEC_646)
[ ] DOS Codepages (https://www.aivosto.com/articles/charsets-codepages-dos.html#codepage861)
[x] ~~MULE_INTERNAL (Multilanguage Emacs internal encoding)~~ Garbage encoding for an even more garbage text editor.
[x] PETSCII (with state for lower/upper mapping based on literal "SHIFT" button state)
[x] ATASCII (with state for lower/upper mapping based on literal "SHIFT" button state)
[x] SHIFT-JIS (already implemented in example code)
[x] Tatar (#15)
[x] ~~UTF-EBCDIC~~ This may be patent-encumbered or license-checked, and therefore cannot be implemented.
[x] ~~UTF-7~~ This may be patent-encumbered or license-prohibited, and therefore cannot be implemented.
[x] ~~UTF-7-IMAP~~ This may be patent-encumbered or license-prohibited, and therefore cannot be implemented.
[x] ~~UTF-1~~ Not a good encoding.

Some that might not be possible within the framework of this library:

Early Canjie input method translation: this is moreso a system of input that is then converted to characters, rather than a character set itself. It also seems to have a (potentially?) unbounded set of inputs that can produce an equally wild amount of outputs, making the encode_one/decode_one limitations potentially useless? Needs more research

marzojr commented 1 year ago

For what is worth, the Unicode Consortium published conversion tables for many of those encodings; conversion to unicode from these encodings end up being going through lookup tables; conversion back is likely the same for "properly normalized" unicode.

The data can be found here: https://github.com/unicode-org/icu-data.

ThePhD commented 1 year ago

Yeah, I've seen that!

For what it's worth, I've already started working on lookup tables for most of the single and double-byte encodings. Albeit, they're not derived from the icu data, but from other sources.

See here: https://github.com/soasis/encoding_tables

marzojr commented 1 year ago

Oh, nice! I was going by the your encoding docs, which, I guess, are out of date then.

soasis / text

Don't Cursed Open Inside #21