unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.35k stars 174 forks source link

Character names #1397

Open js-choi opened 2 years ago

js-choi commented 2 years ago

Hey, I just spoke with @sffc at an Ecma TC39 meeting, and he referred me to here.

ICU4X will eventually need to support lookup of Unicode character names (the Name property of code points, as well as character name aliases and named character sequences).

I’ve been working on trying to compress the collection of Unicode character names into a succinct data structure, occupying as little space as possible (my goal is 100–200 kB) while maintaining efficient bidirectional random access. To that end, in my spare time I’ve been working on an (unfinished) article describing the data structure, as well as an (unfinished) reference JavaScript implementation.

I don’t anticipate my work being done until several months from now at the earliest. But @sffc tells me that work on character names in ICU4X hasn’t started yet either. When ICU4X does reach the point when character names start to get implemented, I’d love to help out however I can. In the meantime, I will continue to work on the JavaScript library before porting it to Rust.

sffc commented 2 years ago

CC @Manishearth @iainireland

First we need to decide if we want unames in ICU4X, and if so, we should consider the work of @js-choi. Putting on the backlog for now.

js-choi commented 2 years ago

Note that the article and the proof of concept are still unfinished (even though they are public). I can try to prioritize finishing them, once this topic gets close to leaving the backlog. Just let me know here when that happens.

But, even though they are still unfinished, it is already clear that there is a lot of low-hanging fruit to compress all Unicode names, while maintaining fast random bidirectional lookup, using succinct or near-succinct data structures.

markusicu commented 2 years ago

a) Getting Unicode character names can be useful for outputting debug info or data files.

b) Unicode character names as input I find not very useful. For example, UTS 18 suggests \N{character name} for non-printable characters, but the name is generally so long that it would make the regex harder to read than using a simple code point escape, and a comment nearby, outside the regex pattern string. Same for use in other types of source code.

c) Character names data is annoyingly large. It sounds like you have better ideas for how to compress them than I had, but it's still a lot of data for marginal usefulness.

d) You can look at what I cooked up for ICU in 1999: https://github.com/unicode-org/icu/blob/main/tools/unicode/c/genprops/namespropsbuilder.cpp contains the builder code, and the file starts with a brief description of the data format. The data for Unicode 15 is 289kB. (It used to look much better in Unicode 3.0 days...) Code point to name is pretty fast, but the other way is a slow search.

js-choi commented 2 years ago

@markusicu: Cheers and thanks for the feedback; I’m a big fan of your work in general.

I definitely agree that the data cost of names is annoyingly large. I, too, would not suggest seriously considering their inclusion in ICU before it is demonstrated that they may be compressed into a succinct data structure (efficient in both time and space). I’m still working on this.

With regards to the usefulness of character→name vs. name→character, my own experience comes from implementing many parsers whose grammars need to treat specific characters in special ways. (These parsers only sometimes use regular-expression syntax; oftentimes these are functional combinator parsers or the like, created from composed expression trees.) I have always found it irksome to embed code points directly in my code—I find them to essentially be magic numbers that hinder readability.

It’s certainly true that some names are long. (The shortest name, “Ox”, is two letters long but the longest name, “Box Drawings Light Diagonal Upper Centre to Middle Right and Middle Left to Lower Centre”, is 88 characters long.) But the average length of names (when pooling the 35,205 explicitly defined names from the current UCD data files) is 25.78 (e.g., “Domino Tile Vertical-06-06”). This isn’t that bad. Good code often tries to emphasize using self-descriptive variable/function names, and a 26-long name wouldn’t be too out of place of such code.

It’s also true that comments can alleviate code points’ magic-number inscrutability, but this is ironically at the cost of increased wordiness (now you have a code-point hex and a longish name) and locality (the explanation of the code point is removed from its original location in the source).

(My ulterior motive is eventually to make a proposal to TC39 to add named character literals to the JavaScript language. But I would not dare to do so until I can allay the engine vendors’ concerns of storage, memory, and retrieval time—hence this project.)

Funnily enough, I (and some other TC39 members I’ve spoken to) may be confident about the usefulness of name→character lookup…but we’ve been much less certain about the usefulness of character→name retrieval. The best use case that I could imagine for the latter was metaprogramming (e.g., JavaScript transpilers and code minifiers). But your point about debugging general text data for human display is well taken.

Your namespropsbuilder.cpp from 1999 is fascinating. The fact that its token-based compression was already able to get down to 289 kB for Unicode 15 is impressive. I think there’s a lot of other low-hanging fruit to squeeze it much more (while improving name→character time performance). I don’t expect my demo to be done soon, but I’ll follow up when I make some significant progress. Cheers!

chris-ha458 commented 11 months ago

Has any decision been made regarding whether to include this function into icu4x?

For context, it is being provided an old version of unic, a third party crate unicode_names2.

To illustrate whether this is useful, this functionality is also used in crates such as charset-normalizer-rs to query certain properties that are not exposed by the current character property database.

Manishearth commented 11 months ago

I think we want this, but someone has to have time to implement it

chris-ha458 commented 11 months ago

I think we want this, but someone has to have time to implement it

By someone, you mean from within the organisation right? For instance, I would be able to implement something that provides the same benefit but likely not following the same behaviors regarding dataproviders and no_std. Ongoing maintanence would be an issue as well.

On the other hand, if atleast initially relevant project requirements could be communicated and if necessary relaxed (at least initially) I might at least try to contribute an early version.

Manishearth commented 11 months ago

Yes, we would accept something externally contributed and can provide feedback on how to make it happen.

The tricky thing to figure out is what data structure to use here: @sffc do we have any thoughts on what we want to use? Are there ICU4C data structures we plan to wrap, or should we be using a varzerovec or something?

js-choi commented 11 months ago

A few years ago, I wrote an unfinished article exploring compressing the set of Unicode names with efficient bidirectional random access, using succinct data structures – such as IBiSRP+DAC-L-TT (see Brisaboa et al. (2019)). There’s also been a lot of other recent research on succinct data structures that should be of interest.

I sadly haven't had the bandwidth to finish the article and its JavaScript proof of concept, but I can definitely see a path to a compact data structure with efficient bidirectional random access. The progress I was making suggested that the data structure could approach the information-theoretic lower bound, somewhere around 110 kB. (I was already able to reduce Unicode names down to with efficient random access to 640 kB, and I had many more optimizations to do when I ran out of time. I hope to revisit this question sometime, because it was a fun exercise to do.)

sffc commented 11 months ago

I'd be fine landing this as two standalone mappers, one for each direction. Name-to-character should be implementable using ZeroTrie (similar to BytesTrie in ICU4C but we benchmarked it to be a bit more efficient for ASCII-only data). Character-to-name is going to be a bit harder and larger but maybe we don't even need that for the first iteration.

Regarding your exploration of bidirectional data structures for this type of thing: this sounds like something that would be useful for ICU4X, so long as the metrics are where we want them to be. A constraint I've hit up against while trying to implement things like this is that we want case-insensitive lookup in one direction but want to get case-normalized results in the other direction. Is this a constraint you've considered?

sffc commented 11 months ago

I would be able to implement something that provides the same benefit but likely not following the same behaviors regarding dataproviders and no_std. Ongoing maintanence would be an issue as well.

On the other hand, if atleast initially relevant project requirements could be communicated and if necessary relaxed (at least initially) I might at least try to contribute an early version.

Wearing my hat as ICU4X tech lead, I would rather make sure we land a proper ICU4X-conventional solution than an interim solution that needs work. We have a lot of experimental code in ICU4X and it creates a maintenance burden and sits for a long time until someone invests the time to bring it up to spec. Since the Rust ecosystem seems to already have this functionality in another crate, I'd rather wait to land it in ICU4X until we can land it properly.

However, I want to emphasize that the ICU4X contribution process is about the cleanest it's ever been. I've seen contributors come up to speed in a matter of hours. We have tutorials and lots of guardrails in place. I think you may be overestimating the effort that would be required to implement this feature "the ICU4X way".

js-choi commented 11 months ago

[@sffc] Regarding your exploration of bidirectional data structures for this type of thing: this sounds like something that would be useful for ICU4X, so long as the metrics are where we want them to be. A constraint I've hit up against while trying to implement things like this is that we want case-insensitive lookup in one direction but want to get case-normalized results in the other direction. Is this a constraint you've considered?

Thanks for the reply. Answering this question: I didn’t run into any hard obstacle when it came to case sensitivity of names. This was my approach:

The UAX44-LM2 fuzzy folding was implemented in this JavaScript module and it is tested in this test suite.

I believe Python’s unicodedata standard module (and the unicode_names2 crate) have a similar approach to case insensitivity. (They do not use the full UAX44-LM2 fuzzy-matching rules.)

sffc commented 11 months ago

Ok, if you store the strings together then it makes sense that you can do "on-the-fly" case folding of the names.

I don't know how to make it work though with a trie-like data structure, with one node per character. For example, if "BOOKS" and "BOOKSTORE" are both in the trie, and we get the query "BOOKSTORE", then after we get to the K node we don't know whether to to to the S node or the node.

js-choi commented 11 months ago

Just to clarify: By “case-insensitive lookup” and “case folding”, are you referring strictly to ignoring the case of letters in names? Or are you more broadly referring to UAX-LM2’s loose-matching rules, which also ignore whitespace, underscores, and medial hyphens? The reason why I ask is because your BOOKS/BOOK_STORE example is about ignoring underscores during name→character lookup, which seems to indicate that you’re talking about difficulty with UAX-LM2’s rules, rather than just case insensitivity. My apologies if I’m misunderstanding!

In any case, neither case insensitivity nor UAX44-LM2 should be a barrier to using ZeroTrie for fuzzy/loose name→character lookup. ZeroTrie only needs to store the “folded” versions of the names:

As you point out earlier, ZeroTrie alone is insufficient for character→name retrieval. To support character→name retrieval, we would have to either (1) combine ZeroTrie with another data structure that can efficiently reconstruct the original names (e.g., a mapping from characters to paths through the trie’s nodes, along with data about any underscores/spaces/hyphens discarded from the trie’s names) or (2) replace ZeroTrie with a bidirectional string dictionary like Brisaboa et al.’s IBiS.

An aside about trie-based compressed string dictionaries vs. IBiS Many interesting new succinct bidirectional string dictionaries use compressed tries: e.g., [Grossi and Ottaviano’s path-decomposed tries (PDTs)](https://dl.acm.org/doi/abs/10.1145/3357384.3357972) and [Boffa et al.’s CoCo-tries](https://link.springer.com/chapter/10.1007/978-3-031-20643-6_17). However, all of these trie-based approaches are limited by how tries must reconstruct strings from the same data that they use to look up strings’ values. They cannot support fuzzy/loose matching using with UAX44-LM2, as far as I know. Brisaboa et al.’s “IBiS” compressed string dictionaries do not use tries, and because of this they do support fuzzy/loose matching with UAX44-LM2: They can somewhat decouple how they look up strings from how they actually store the strings. An IBiS dictionary can store names’ substrings so that they are binary searchable by their folded versions. This is valid as long as each entire name’s folded version is identical to the concatenation of its folded substrings. The original names’ substrings would be what the IBiS dictionary actually stores: e.g., `BOOKS-U` and `BOOK_STORE` would respectively be decomposed into (`BOOK`, `S-U`) and (`BOOK`, `_STORE`), and the IBiS dictionary would store a binary search tree with nodes for `BOOK`, `_STORE`, and `S-U`, in that order. That order would be the lexicographic ordering of the name substrings’ *folded* versions as per UAX44-LM2 (e.g., `S-U` folds to `SU` and `_STORE` folds to `STORE`, and `SU` is *after* `STORE`). The substrings in folded-version order together form the binary search tree. In this binary tree, looking up `bOoKsU` is equivalent to `BOOKS-U`, and `BoOkS-TORE` is equivalent to `BOOK_STORE`. From this binary tree, `BOOKS-U` and `BOOK_STORE` can be reconstructed in their original forms. (The binary search tree’s substrings are also compressed using tree-based front compression and a Re-Map grammar.)
sffc commented 11 months ago

Thanks for the detailed response!

I was mainly thinking about another application of this in time zone names. "Foo/Bar" and "Foo/BAZ" are both valid entries in the ZeroTrie. If I store them case-folded, I can't retrieve the original case-canonical form, and if I store them as case-canonical, I can't do case-insensitive lookup since "B-a" and "B-A" go to different nodes in the trie.

I will read more about IBiS; thanks for the pointer!