Port BytesTrie to ICU4X

sffc commented 4 years ago

BytesTrie is a data structure in ICU4C that maps from byte strings to integer values. This data structure is fundamental to various pieces of functionality, including case mapping and language matching. We should port this data structure to ICU4X.

zbraniecki commented 3 years ago

Prior art:

Any of those may be a good starting point?

sffc commented 3 years ago

@kpozin What are your plans on this issue?

zbraniecki commented 3 years ago

Would SetTrie be a good entry point for this - https://github.com/KaiserKarel/set-trie ?

kpozin commented 3 years ago

Would SetTrie be a good entry point for this - https://github.com/KaiserKarel/set-trie ?

Unfortunately, no, if we want to maintain backward compatibility.

The main complication is that we're trying to achieve perfect compatibility with an existing data format that does not have a formal specification. The C++/Java code is the spec. The existing code relies heavily on data and method inheritance, reuses a single set of mutable builder classes for several discrete building stages, and even mutates built Tries while traversing them. As you can imagine, these design decisions make the design exceptionally Rust-unfriendly. In writing a port, I've started the whole thing over several times, with varying degrees of Rust-iness and Java-ness in my design.

Moreover, all the logic is closely coupled together in such a way that unit testing individual methods or build stages is impractical (there do not appear to be any unit tests in the legacy code); one basically has to port the entire conglomeration, both Builder and Trie, and then port the integration tests one by one to discover what's broken. This is the stage I'm at now -- I have a full port of most of the production code, am porting tests, and learning the numerous ways in my port is utterly broken.

kpozin commented 3 years ago

I'm aware of the sunk costs (I've sunk a fair portion of them :), but as I proposed at the beginning of this project, we would be much better off finding a standard, well-specified trie format and implementing that, instead of continuing down this road.

markusicu commented 3 years ago

@sffc :

This data structure is fundamental to various pieces of functionality, including case mapping and language matching. We should port this data structure to ICU4X.

Not quite. In ICU, case mapping, normalization, collation, and character properties use “code point tries” which are heavily optimized for lookup by code point, as well as by walking through UTF-16 and UTF-8 code units. Between ca. 2001 and 2018 I came up with three versions of this. I moved some ICU code to the third version, and turned that on its own into public API. I would love to convert code that still uses older versions to the latest one, given time.

Design doc: http://site.icu-project.org/design/struct/utrie

By contrast, BytesTrie and UCharsTrie are designed for lookups from arbitrarily long sequences of bytes (8-bit units) vs. UChars (16-bit units). These are more-classic retrieval trees. You could use them for lookup from single code points if you turn those into sequences of 8-bit or 16-bit units, but these “string tries” are not especially optimized for that. We use them for BreakIterator dictionaries, in collation for contraction matching (after the first code point), and for matching Unicode character property (and value) names.

One of the design points for the “string tries” is to get a byte-serialized or 16-bit-word-serialized sequence that requires no or only trivial byte swapping for little-endian vs. big-endian machines, for use as read-only, memory-mapped data.

If you “abused” a “string trie” for pure code point lookups, I would expect them to be a lot slower than code point tries, but they may be more compact.

Design docs:

(There is another version of tries in the ICU character conversion code, customized in various ways to deal with special issues there. Both code point tries and string tries, since some charsets map multiple Unicode code points to single charset codes.)

@kpozin :

The existing code relies heavily on data and method inheritance, reuses a single set of mutable builder classes for several discrete building stages

Yes. In principle, you could write a builder from scratch, as long as you build the data structure that the runtime understands.

By the way, I would port the Java builder, not the C++ builder. Late in development, I made a significant improvement to the Java builder that I never got around to porting back to C++. (I have an old ICU ticket for that that I keep meaning to get to.)

and even mutates built Tries while traversing them.

That is either a misunderstanding or an overstatement. The built structure is immutable. But of course while you walk the structure you need a mutable iterator, and that's what the Trie classes are. Each iterator object is tiny. It just keeps a pointer into the big array and a little bit of extra state.

Moreover, all the logic is closely coupled together in such a way that unit testing individual methods or build stages is impractical (there do not appear to be any unit tests in the legacy code); one basically has to port the entire conglomeration, both Builder and Trie, and then port the integration tests one by one to discover what's broken.

Sorry. And thanks for your bug report! I intend to fix that for ICU 69; I need to think about how to unit-test the internal function that has the bogus code. I would be happy to entertain PRs for better unit testing!

markusicu commented 3 years ago

The runtime code for each of these data structures is fairly small. It might work well for ICU4X to have offline builder code calling ICU4J or ICU4C that writes Rust code with initializer lists. We already have some C++ code that writes C and Java initializers, especially for code point tries.

sffc commented 3 years ago

Summary of conclusions form a meeting with me, @kpozin, @markusicu, @echeran, @zbraniecki, @dminor:

CodePointTrie (aka UCPTrie) is the right tool for selecting enumerated properties (one code point mapping to one of several enumerated values)
BytesTrie is useful for likely subtags, etc., but probably the wrong tool for enumerated properties
Both CodePointTrie and BytesTrie have a binary representation, which I refer to as an "ArrayBuffer"
Enumerated property selection is what's needed for segmentation
We could reinvent the wheel in ICU4X by building our own data structures, but I'm convinced that we should build on the shoulders of giants and use the same data structures as ICU4C/ICU4J
It's not a good use of our time to write the BytesTrie/CodePointTrie builder code in ICU4X, which is what @kpozin was attempting to do above

Next steps:

Continue with #448, which may be useful for regular expression performance.
Manually extract the ArrayBuffers from the ICU4C/ICU4J tests as a one-time job, and use them as input to ICU4X unit tests.
Add a tool to ICU4C that creates JSON artifacts containing ArrayBuffers for enumerated property selection; the artifacts should be versioned and shipped alongside an ICU release. Assignee: undetermined

echeran commented 3 years ago

Here are the full meeting notes for yesterday's meeting.

sffc commented 3 years ago

CodePointTrie is split off into #508.

sffc commented 3 years ago

Note: Design of language matching is in #173

sffc commented 3 years ago

See https://github.com/unicode-org/icu/pull/1660 for how we're exporting CodePointTrie data from ICU4C into toml files.

sffc commented 3 years ago

Blocks https://github.com/unicode-org/icu4x/issues/842

sffc commented 3 years ago

@fabura is going to take a look at this.

sffc commented 3 years ago

Take a look at:

CodePointTrie, which is the closest thing we have in ICU4X to what you would be building: https://github.com/unicode-org/icu4x/blob/main/experimental/codepointtrie/src/codepointtrie.rs
BytesTrie documentation in ICU4C: https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1BytesTrie.html
BytesTrie impls:
- Code data structure: https://github.com/unicode-org/icu/blob/main/icu4c/source/common/bytestrie.cpp
- Builder: https://github.com/unicode-org/icu/blob/main/icu4c/source/common/bytestriebuilder.cpp
- Iterator: https://github.com/unicode-org/icu/blob/main/icu4c/source/common/bytestrieiterator.cpp

Consider writing code under experimental/bytestrie. Consider copying boilerplate (Cargo.toml, LICENSE, src/lib.rs) from codepointtrie.

makotokato commented 3 years ago

I have tiny version of ICU4C/ICU4J compatible bytetrie/uchartrie (https://github.com/makotokato/dictionary_segmenter/blob/main/src/bytes_trie.rs and https://github.com/makotokato/dictionary_segmenter/blob/main/src/uchars_trie.rs) since I need this to use dictionary segmenter of ICU4C/ICU4J.

dminor commented 2 years ago

Makoto will need this for the segmenter. We should find an owner. It should be straightforward based upon the Char16Trie implementation and Makoto's implementation mentioned above.

sffc commented 2 years ago

Makoto will need this for the segmenter.

Could you clarify? My understanding is that BytesTrie and CharsTrie are largely interchangeable. If we build our own data, then we can build it as either BytesTrie or CharsTrie; either should work.

dminor commented 2 years ago

He was concerned about data size if we used CharsTrie... if that's not a valid concern, then that is good to know.

markusicu commented 2 years ago

I co-developed BytesTrie and CharsTrie at the same time. In principle, they work almost the same.

BytesTrie works with input sequences of bytes (u8), and is itself stored as a sequence of bytes.
CharsTrie works with input sequences of 16-bit words (u16), and is itself stored as a sequence of 16-bit words. They both output the same 4-way result enum, and an optional 32-bit value.

BytesTrie is more natural when the input is already a sequence of bytes, or is naturally encoded as such. In ICU, I use it for Unicode property aliases and for Thai/Khmer/Burmese dictionaries; the latter are small scripts and use a custom mapping from Unicode code points to single bytes.

CharsTrie is more natural when the input units are larger. In ICU, I use it for CJK dictionaries and collation contraction lookup, encoding the input as UTF-16 where that's not already the case.

The storage of these "string tries" uses variable-width encodings of result values and internal offsets. The bigger the values and offsets, the more storage units need to be stored and decoded. My hunch is that for small amounts of data a BytesTrie should be smaller, and for large amounts of data a CharsTrie should be faster.

Also, using a BytesTrie for large input units means you need to convert your inputs to multiple bytes (more bytes than the number of 16-bit words you would need), which adds some encoding cost and requires a larger number of lookups.

I have not done detailed benchmarking of one vs. the other for a variety of types and volumes of data.

sffc commented 2 years ago

My preference is to start with CharsTrie, since we already have it implemented, and since I understand it to be the simpler of the two. We should use BytesTrie only if we've tried using CharsTrie and found it to be too large or slow for the use case.

sffc commented 1 year ago

See https://github.com/unicode-org/icu4x/pull/1155 for an early attempt at this.

unicode-org / icu4x

Port BytesTrie to ICU4X #131