Open clarfonthey opened 6 years ago
Many systems, specially RegEx engines, allow escaping Unicode characters in the form of '\N{dot operator}
or '\u{dot operator}
. IIRC, there's already a GH issue files for Rust for this.
In the meanwhile, I guess we can add this to unic-ucd-name
, as an optional feature.
unicode_names provides this today.
As part of this Reddit thread, the author of said crate mentioned its compression algorithm. Given #199 and the already shaky compile times for unic/ucd/name, I've been considering basically bringing in unicode_names's table structure, and thus the bidirectionality. I think the reverse-direction data can be optional. Once that exists, it would be trivially possible to create a proc-macro that uses that crate.
I'm assigning this to myself, as a proper resolution to the lurking horror that is #199 will make this simple to add.
To add some additional context for others following this issue. (See this reddit thread for the impetus.)
One possible way to provide the name->char mapping is with finite state transducers. The TL;DR is that an FST is a finite state machine that compactly represents sets of byte strings, and can also represent a simple bytes -> u64
mapping as well. For all Unicode names (even including generated Hangul/Ideograph names and all aliases), the total size on disk and in memory is 230KB. An FST is searchable in its compact form.
Currently, the only required dependency of fst
is byteorder
(it by default requires the memmap
crate, but that can be disabled). It also currently relies on the standard library, but if there's interest, it should definitely be possible to provide a no_std
mode (maybe a new fst-core
crate, not sure) that can read FSTs but cannot write them.
A key downside of FST is that its compaction comes with slower access times when compared to, say, a standard trie. I suspect the difference isn't too large though, and there's probably room for more optimizations.
any word on this?
Essentially, something like:
I've found that in some cases where I've wanted to use less-known unicode characters, that I've resorted to just looking up the escape codes and including these directly in my source. This is error-prone and it'd be a lot easier to catch a name that doesn't match than an escape code that doesn't.