Generate both small and fast tries as Cargo feature alternatives in baked data

Currently, the baked data we publish on crates.io uses the small type for all tries. For an app that depends on icu_normalizer with compiled_data, opting into the fast trie type for the NFD and NFKD tries is excessively complicated.

databake should generate the data for both trie types so that each data key that maps to a data struct containing a trie gets a Cargo feature for upgrading it to the fast mode.

#[cfg(not(feature = "fast_canonical_decomposition"))]
pub const SINGLETON_CANONICAL_DECOMPOSITION_DATA_V2_MARKER: /* small trie data */ ;

#[cfg(feature = "fast_canonical_decomposition")]
pub const SINGLETON_CANONICAL_DECOMPOSITION_DATA_V2_MARKER: /* fast trie data */ ;

This way apps could opt into larger but faster data while still using data published on crates.io.

Granular features make sense, because it's quite plausible that an app would want to opt for fast tries for NFD and NFKD while keeping a small trie for UTS 46. It's also plausible to want to do this on a per-property basis.

Alternative approach

ICU4X developers making the call of which tries should always be in the fast mode and which ones should always be in the small mode. Benefits: Not having to branch on trie type at run time and users of the library not needing expertise to make the choice.

Working group discussion:

@sffc We've looked at the problem space. What is the overall direction we want to go?
@echeran What does ICU4C do if you call the wrong getter? GIGO?
@hsivonen Not sure. I'm not sure what explains the performance difference between ICU4C and the current ICU4X impl. We can eliminate the culprit being the non-alignment that is handled by ZeroVec, so my next guess is it's the memory safety checks in the trie.
@Manishearth I don't think we should bake both in the way shown in the issue. If we don't want trietype to be a datagen configurable thing, then they should be different data markers.
@sffc Two use cases to optimize. Normalizer - you want the smallest code, and the fastest performance. It's not clear what achieves the fastest performance, but it's okay to compromise on code size for that, even up to 100 kb to get that. Making users who want fast performance use the fast trie is fine. The slow tries aren't that slow. It's not as noticable decrease in performance as it is between using dictionaries vs. LSTM when doing segmentation.
@sffc I'd like to see the results of experimentation to see what code changes produce what performance changes, and then see how to design APIs accordingly. I think it's more productive to discuss what the best APIs to support that would be.
@hsivonen I don't think an app would want small or fast for all components. Maybe NFK would want Fast but UTS46 would want Small.
@sffc Another thing that we could do is picking the trie type for a particular use case ahead of time and not allowing that to be configurable with any sort of toggle switch. @hsivonen's work to include this in Spidermonkey seems like the best way to find out the real world use cases.
@Manishearth I'm okay with doing that.
@sffc For the purposes of v2.0, not much sounds actionable for the Normalizer side (unless @hsivonen has a lot of time).
@hsivonen After v2.0, we can add it and discourage people from using it until it is ready.
@Manishearth Yes, we can make the doc hidden.
@hsivonen And then we can figure out a way to make some tries fast and some be small.

Overall conclusion:

Datagen could produce different trie types by data marker, either by user choice or by automatic / recommended selection
CodePointTrie can export semi-internal functions for normalizer
The branch between fast/slow can go at the top of normalize_str
This can all be done after 2.0

LGTM: @sffc @Manishearth @hsivonen

unicode-org / icu4x

Generate both small and fast tries as Cargo feature alternatives in baked data #5821

Alternative approach