unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.38k stars 178 forks source link

Generate both small and fast tries as Cargo feature alternatives in baked data #5821

Open hsivonen opened 1 week ago

hsivonen commented 1 week ago

Currently, the baked data we publish on crates.io uses the small type for all tries. For an app that depends on icu_normalizer with compiled_data, opting into the fast trie type for the NFD and NFKD tries is excessively complicated.

databake should generate the data for both trie types so that each data key that maps to a data struct containing a trie gets a Cargo feature for upgrading it to the fast mode.

#[cfg(not(feature = "fast_canonical_decomposition"))]
pub const SINGLETON_CANONICAL_DECOMPOSITION_DATA_V2_MARKER: /* small trie data */ ;

#[cfg(feature = "fast_canonical_decomposition")]
pub const SINGLETON_CANONICAL_DECOMPOSITION_DATA_V2_MARKER: /* fast trie data */ ;

This way apps could opt into larger but faster data while still using data published on crates.io.

Granular features make sense, because it's quite plausible that an app would want to opt for fast tries for NFD and NFKD while keeping a small trie for UTS 46. It's also plausible to want to do this on a per-property basis.

Alternative approach

ICU4X developers making the call of which tries should always be in the fast mode and which ones should always be in the small mode. Benefits: Not having to branch on trie type at run time and users of the library not needing expertise to make the choice.

sffc commented 1 week ago

Working group discussion:

Overall conclusion:

  1. Datagen could produce different trie types by data marker, either by user choice or by automatic / recommended selection
  2. CodePointTrie can export semi-internal functions for normalizer
  3. The branch between fast/slow can go at the top of normalize_str
  4. This can all be done after 2.0

LGTM: @sffc @Manishearth @hsivonen