qntm / base32768

Binary-to-text encoding highly optimised for UTF-16
MIT License
133 stars 5 forks source link

Add extra information in the ReadMe #5

Closed DonaldTsang closed 5 years ago

DonaldTsang commented 5 years ago

This repo is very similar to https://github.com/rinick/base2e15 and https://github.com/grandchild/base32k except that 2e15 uses Unified CJK characters instead of other non-CJK characters. It would be good to have a pros-cons section.

qntm commented 5 years ago

Both Base215 and base32k have identical compression characteristics to Base32768, and significantly smaller lookup tables than Base32786. However, both encodings make extensive use of characters which are subject to change when NFD or NFKD normalization are applied. If a Base215 or base32k text is passed through a mechanism which subjects its text content to NFD or NFKD normalization, then the text will change, which for our purposes constitutes data corruption.

In the case of Base215 11,172 characters from the 32,768+128-character repertoire are unsafe in this way. In the case of base32k, it's only 8,192.

I will consider adding this information to the README for this project.

DonaldTsang commented 5 years ago

@qntm Firstly, what are NFD and NFKD? Secondly, what about the goal of "using only CJKV and Hangul characters that renders on android"? Is that too much to ask for?

qntm commented 5 years ago

NFD and NFKD are Unicode normalization forms. The characters we use must be immune to normalization in order to be considered safe for use for the purposes of encoding binary.

The number of available safe CJKV and Hangul characters in the Basic Multilingual Plane is unlikely to be 32,768, which is what we need. To get 32,768 safe characters for Base32768, I had to pull from a wide variety of character ranges, and they were not in large contiguous blocks, but relatively small blocks of 32.

Whether or not any particular character will render on Android is completely subjective. It depends what kind of font support the user has chosen to install. However, this was not one of the design goals of Base32768 (or any of my other encodings). It is not important that the Base32768 text render and be readable, as it is not intended to be read/understood/transcribed by humans. The intent is that it can be copied and pasted easily without errors.

DonaldTsang commented 5 years ago

@qntm the reason why I suggested the android rendering rule is to prevent copying errors (in case non-existing characters rendered as nothing). I do understand the variations in phone font rendering though. Also could you provide the list of safe characters in CJK Unified Ideographs, its Extension A and Hangul Syllables ? (see https://github.com/rinick/base2e15#mapping-table)

qntm commented 5 years ago

Non-existing characters typically render as an empty box, so I don't think that's too much of a problem.

You can determine a list of safe characters yourself using safe-code-point.

DonaldTsang commented 5 years ago

Given the ranges 0x3400-0x4DB5, 0x4E00-0x9FEA and 0xAC00-0xD7A3 how would you create a standard set for the base? Perhaps 0x3400-0x4BFF, 0x5000-0x9BFF and 0xAC00-0xD3FF ? That would still leave an extra of 3072 codepoints.

qntm commented 5 years ago

@DonaldTsang Yeah, but not enough of them are safe, so it isn't possible.

This discussion no longer seems to be about the Base32768 encoding or about improving the base32768 module. So I'm closing this issue.

DonaldTsang commented 5 years ago

@qntm I still think it is about it, just that if it is possible to make base32768 within the CJK and Hangul ranges for "square" consistancy. Because it is hard to use safe-code-point as a Python developer (when there is no npm or even a plain <script> library for it).