open-i18n / rust-unic

UNIC: Unicode and Internationalization Crates for Rust
https://crates.io/crates/unic
Other
234 stars 24 forks source link

[unic-ucd-name] Macro for getting Unicode characters by name #210

Open clarfonthey opened 6 years ago

clarfonthey commented 6 years ago

Essentially, something like:

const MULTIPLY_DOT: char = named_char!("dot operator");

I've found that in some cases where I've wanted to use less-known unicode characters, that I've resorted to just looking up the escape codes and including these directly in my source. This is error-prone and it'd be a lot easier to catch a name that doesn't match than an escape code that doesn't.

behnam commented 6 years ago

Many systems, specially RegEx engines, allow escaping Unicode characters in the form of '\N{dot operator} or '\u{dot operator}. IIRC, there's already a GH issue files for Rust for this.

In the meanwhile, I guess we can add this to unic-ucd-name, as an optional feature.

CAD97 commented 6 years ago

unicode_names provides this today.

As part of this Reddit thread, the author of said crate mentioned its compression algorithm. Given #199 and the already shaky compile times for unic/ucd/name, I've been considering basically bringing in unicode_names's table structure, and thus the bidirectionality. I think the reverse-direction data can be optional. Once that exists, it would be trivially possible to create a proc-macro that uses that crate.

I'm assigning this to myself, as a proper resolution to the lurking horror that is #199 will make this simple to add.

BurntSushi commented 6 years ago

To add some additional context for others following this issue. (See this reddit thread for the impetus.)

One possible way to provide the name->char mapping is with finite state transducers. The TL;DR is that an FST is a finite state machine that compactly represents sets of byte strings, and can also represent a simple bytes -> u64 mapping as well. For all Unicode names (even including generated Hangul/Ideograph names and all aliases), the total size on disk and in memory is 230KB. An FST is searchable in its compact form.

Currently, the only required dependency of fst is byteorder (it by default requires the memmap crate, but that can be disabled). It also currently relies on the standard library, but if there's interest, it should definitely be possible to provide a no_std mode (maybe a new fst-core crate, not sure) that can read FSTs but cannot write them.

A key downside of FST is that its compaction comes with slower access times when compared to, say, a standard trie. I suspect the difference isn't too large though, and there's probably room for more optimizations.

devsnek commented 5 years ago

any word on this?