unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.29k stars 165 forks source link

Reduce ICU4X's dependence on ICU4C data #4602

Open robertbastian opened 4 months ago

robertbastian commented 4 months ago

It would be nice to cut out the middle man and construct as much data as possible directly from "the source". The icuexportdata we currently use contains:

I think it's desirable for ICU4X to be as independent of ICU4C as possible, in order to identify and upstream any custom ICU4C behaviour.

sffc commented 4 months ago

Some background here:

The UCD is heavily pre-processed in the ICU4C data build into a form known as ppucd. A decision was made early on that it was less work and more maintainable to maintain the UCD pre-processing code in one place, which is why we initially created icuexportdata. We then leveraged this machinery to capture similar pre-processing that happens for collation/normalization/casemap data.

One potential advantage to leveraging ICU4C for these larger property blobs is that it paves the way for us to potentially share data files for some of these structures between C and X.

So while I'm not opposed to heading in this direction, whoever takes this issue should research exactly the nature of the machinery we're using in ICU4C, study the impact on cross-compatible data files, and create more bite-sized milestones.

robertbastian commented 4 months ago

ppucd seems to be exactly the kind of code that should not be in ICU4X's data pipeline. It's an assortment of python scripts that are tightly coupled to ICU4C.

Other Rust users are already reading the UCD, so it can't be that hard?

One potential advantage to leveraging ICU4C for these larger property blobs is that it paves the way for us to potentially share data files for some of these structures between C and X.

I don't see what this has to do with runtime representation. Neither the current text files in icuexportdata nor the UCD text files are a runtime format.

robertbastian commented 4 months ago

Can you confirm whether collation data is CLDR-derived?

Manishearth commented 4 months ago

I would be in favor of this in the long run. I'm not sure how much work it is and if it's worth it.

hsivonen commented 4 months ago

Can you confirm whether collation data is CLDR-derived?

The root collation is built separately from the tailorings. The root is built from DUCET with LDML root refinements applied. The tool that builds it is genuca, which has its own special Bazel build instead of using the usual ICU4C build system. The root is built in four configurations: Two axes: ICU4C vs. ICU4X and unihan vs. implicithan.

Once the root has been built, genrb can build the tailorings from CLDR data (and also write the root as TOML). These are built in an ICU4X-specific mode that omits the canonical closure and Latin fast path data.

Of the types of data mentioned in this issue, building the collation data without ICU4C would be by far the largest effort.

The second-largest effort would be much, much simpler, but still complicated. The second-largest effort would be building the UTS 46 data into the form of a special normalization.

sffc commented 3 months ago

Discuss with:

Optional:

markusicu commented 3 months ago

Quick notes

Manishearth commented 3 months ago

Thanks.

For normalization especially I would somewhat prefer to rely on ppucd or directly on UCD. The current situation is extremely suboptimal: the normalization properties are exported as a part of icuexportdata, ICU4C-using C++ code that is not particularly easy to understand. The group of people that needs to debug that code (ICU4X) is not the group of people that can easily understand it (ICU4C devs), and I've already had to spend a bunch of time fixing segfaults and other issues in it.

Still, I'm not convinced that the code will be equal complexity if maintained by us in ICU4X datagen: the ICU4C code is able to invoke normalizer, whereas we would not be able to invoke our own normalizer and may have to do some manual work here. I'm hoping we can maintain the same complexity (it has to exist somewhere) but I'm not fully clear on everything that code does to be sure.

Collation is messier and I'm less sure if we should try to reduce that ICU4C dependency yet.

sffc commented 3 months ago

Is there an opportunity to use a codepointtrie_builder-style approach, where we take the core algorithms from C++ and call them from ICU4X via WASM or FFI?

Manishearth commented 3 months ago

I don't think so, because the ICU4C normalizer will rely on ICU4C normalizer data.

(and that's the main "core algorithm" of consequence)

hsivonen commented 3 months ago

ICU4C-using C++ code that is not particularly easy to understand.

This arises from getting the data into the form that the ICU4X normalizer expects and, potentially, from a certain lack of polish. It doesn't arise from C++ or ICU4C.

The group of people that needs to debug that code (ICU4X) is not the group of people that can easily understand it (ICU4C devs)

I wrote that code, so while we may have a truck number problem, I don't think that analysis of who debugs and who understands accurately describes the situation.

That code needs 3 things from ICU4C:

  1. The code point trie builder.
  2. A normalizer that already works (recursive canonical and compatibility decompositions, non-recursive canonical decomposition, canonical composition).
  3. A normalizer that does UTS 46 mapped/disallowed handling as a normalization.

The third item would take the most effort to replicate from UCD data files without ICU4C.

Overall, I think it would be ideal if ICU4X was self-hosted, but as a practical matter, I think we should put engineering effort into reaching ECMA-402 coverage of the ICU4X feature set instead of putting engineering effort into decoupling the data pipeline from ICU4C at this time.

The current situation with Unicode 16 introducing characters with novel normalization behaviors delaying ICU4C's data update and that blocking ICU4X's data update makes the whole thing look scarier than it is in the usual case.

Is there an opportunity to use a codepointtrie_builder-style approach, where we take the core algorithms from C++ and call them from ICU4X via WASM or FFI?

No. Wasm wouldn't solve the problem that the ICU4C normalizer didn't anticipate the novel normalization behaviors that are now blocking the normalization data update.

Manishearth commented 3 months ago

This arises from getting the data into the form that the ICU4X normalizer expects and, potentially, from a certain lack of polish. It doesn't arise from C++ or ICU4C.

When I was previously fixing bugs here it did involve chasing down ICU4C APIs to understand their nuances.

Like, my experience with this code is precisely that I needed to be an ICU4C expert to fix it.

And there is very little reason for ICU4C devs to be looking at this code; it is almost always going to be ICU4X code.

The current situation with Unicode 16 introducing characters with novel normalization behaviors delaying ICU4C's data update and that blocking ICU4X's data update

The tricky thing here is not just that it blocks our update: it's that Unicode expects implementors to have trouble with this update, and having time to fix things is crucial.

The third item would take the most effort to replicate from UCD data files without ICU4C.

Good news: the third item is not necessary to fix the problem we're facing: we can continue to do UTS 46 mappings via icuexportdata but move (2) over to datagen since we only need the single-character recursive stuff (and some other things) which can be done directly from data. (@eggrobin helped with this observation)

hsivonen commented 3 months ago

We don't need UTS 46 to do alpha testing on normalization, but the ICU4X normalizer data merges auxiliary tables for UTS 46 and the K normalizations, so for actual deployment, all the normalizations need to come from the same builder.

sffc commented 3 months ago

(@markusicu and @hsivonen deep dive on normalization data pipeline)

Conclusions:

The above bullet points can be actioned, but are subject to the normal ICU4X prioritization process.

LGTM: @sffc @eggrobin @markusicu @Manishearth @robertbastian

Furthermore, making ICU4X fully independent of ICU4C, and vice-versa, should be our long-term goal. Both projects should read directly from UCD or other shared sources, and those sources should ship data useful for clients.

LGTM: @robertbastian @Manishearth, @eggrobin (@sffc, @hsivonen, @markusicu in principle)