Reduce ICU4X's dependence on ICU4C data

robertbastian commented 4 months ago

It would be nice to cut out the middle man and construct as much data as possible directly from "the source". The icuexportdata we currently use contains:

casemapping, properties, normalization
- These should only require data from the Unicode Character Database. We should add the UCD as a ground truth data source (maybe using the existing ucd_parse crate) and generate the data from it.
segmentation dictionaries
- This is probably not something that can be upstreamed into Unicode, but we should consider whether a non-ICU4C location could be more appropriate (we already use a dedicated data source for LSTM segmentation models)
collation
- I think the source of truth here is CLDR, so we should ideally generate this from CLDR data.

I think it's desirable for ICU4X to be as independent of ICU4C as possible, in order to identify and upstream any custom ICU4C behaviour.

sffc commented 4 months ago

Some background here:

The UCD is heavily pre-processed in the ICU4C data build into a form known as ppucd. A decision was made early on that it was less work and more maintainable to maintain the UCD pre-processing code in one place, which is why we initially created icuexportdata. We then leveraged this machinery to capture similar pre-processing that happens for collation/normalization/casemap data.

One potential advantage to leveraging ICU4C for these larger property blobs is that it paves the way for us to potentially share data files for some of these structures between C and X.

So while I'm not opposed to heading in this direction, whoever takes this issue should research exactly the nature of the machinery we're using in ICU4C, study the impact on cross-compatible data files, and create more bite-sized milestones.

robertbastian commented 4 months ago

ppucd seems to be exactly the kind of code that should not be in ICU4X's data pipeline. It's an assortment of python scripts that are tightly coupled to ICU4C.

Other Rust users are already reading the UCD, so it can't be that hard?

One potential advantage to leveraging ICU4C for these larger property blobs is that it paves the way for us to potentially share data files for some of these structures between C and X.

I don't see what this has to do with runtime representation. Neither the current text files in icuexportdata nor the UCD text files are a runtime format.

robertbastian commented 4 months ago

Can you confirm whether collation data is CLDR-derived?

Manishearth commented 4 months ago

I would be in favor of this in the long run. I'm not sure how much work it is and if it's worth it.

hsivonen commented 4 months ago

Can you confirm whether collation data is CLDR-derived?

The root collation is built separately from the tailorings. The root is built from DUCET with LDML root refinements applied. The tool that builds it is genuca, which has its own special Bazel build instead of using the usual ICU4C build system. The root is built in four configurations: Two axes: ICU4C vs. ICU4X and unihan vs. implicithan.

Once the root has been built, genrb can build the tailorings from CLDR data (and also write the root as TOML). These are built in an ICU4X-specific mode that omits the canonical closure and Latin fast path data.

Of the types of data mentioned in this issue, building the collation data without ICU4C would be by far the largest effort.

The second-largest effort would be much, much simpler, but still complicated. The second-largest effort would be building the UTS 46 data into the form of a special normalization.

sffc commented 3 months ago

Discuss with:

@markusicu
@hsivonen
@eggrobin
@robertbastian
@Manishearth

Optional:

@sffc

markusicu commented 3 months ago

Quick notes

For the ppucd = Preparsed UCD, see https://unicode-org.github.io/icu/design/props/ppucd.html
I created ppucd for ICU, but it's a mildly processed collection of Unicode data into one place because I found parsing multiple UCD files tedious.
ppucd omits data that ICU does not care about (e.g., Unihan properties) but includes Unicode data beyond the UCD proper (currently for UTS46, soon for UTS39)
If you need more properties than ICU, then we can add them to ppucd which should be fine for ICU.
The collation tailoring builder is large and complex.
ICU4X uses its own runtime data format for normalization.
If you need more frequent updates of ICU Unicode data, then let's talk about how we can work together on that.
I plan to tackle the Unicode 16 normalization problem in ICU by the end of 2024-mar.

Manishearth commented 3 months ago

Thanks.

For normalization especially I would somewhat prefer to rely on ppucd or directly on UCD. The current situation is extremely suboptimal: the normalization properties are exported as a part of icuexportdata, ICU4C-using C++ code that is not particularly easy to understand. The group of people that needs to debug that code (ICU4X) is not the group of people that can easily understand it (ICU4C devs), and I've already had to spend a bunch of time fixing segfaults and other issues in it.

Still, I'm not convinced that the code will be equal complexity if maintained by us in ICU4X datagen: the ICU4C code is able to invoke normalizer, whereas we would not be able to invoke our own normalizer and may have to do some manual work here. I'm hoping we can maintain the same complexity (it has to exist somewhere) but I'm not fully clear on everything that code does to be sure.

Collation is messier and I'm less sure if we should try to reduce that ICU4C dependency yet.

sffc commented 3 months ago

Is there an opportunity to use a codepointtrie_builder-style approach, where we take the core algorithms from C++ and call them from ICU4X via WASM or FFI?

Manishearth commented 3 months ago

I don't think so, because the ICU4C normalizer will rely on ICU4C normalizer data.

(and that's the main "core algorithm" of consequence)

hsivonen commented 3 months ago

ICU4C-using C++ code that is not particularly easy to understand.

This arises from getting the data into the form that the ICU4X normalizer expects and, potentially, from a certain lack of polish. It doesn't arise from C++ or ICU4C.

The group of people that needs to debug that code (ICU4X) is not the group of people that can easily understand it (ICU4C devs)

I wrote that code, so while we may have a truck number problem, I don't think that analysis of who debugs and who understands accurately describes the situation.

That code needs 3 things from ICU4C:

The code point trie builder.
A normalizer that already works (recursive canonical and compatibility decompositions, non-recursive canonical decomposition, canonical composition).
A normalizer that does UTS 46 mapped/disallowed handling as a normalization.

The third item would take the most effort to replicate from UCD data files without ICU4C.

Overall, I think it would be ideal if ICU4X was self-hosted, but as a practical matter, I think we should put engineering effort into reaching ECMA-402 coverage of the ICU4X feature set instead of putting engineering effort into decoupling the data pipeline from ICU4C at this time.

The current situation with Unicode 16 introducing characters with novel normalization behaviors delaying ICU4C's data update and that blocking ICU4X's data update makes the whole thing look scarier than it is in the usual case.

Is there an opportunity to use a codepointtrie_builder-style approach, where we take the core algorithms from C++ and call them from ICU4X via WASM or FFI?

No. Wasm wouldn't solve the problem that the ICU4C normalizer didn't anticipate the novel normalization behaviors that are now blocking the normalization data update.

Manishearth commented 3 months ago

This arises from getting the data into the form that the ICU4X normalizer expects and, potentially, from a certain lack of polish. It doesn't arise from C++ or ICU4C.

When I was previously fixing bugs here it did involve chasing down ICU4C APIs to understand their nuances.

Like, my experience with this code is precisely that I needed to be an ICU4C expert to fix it.

And there is very little reason for ICU4C devs to be looking at this code; it is almost always going to be ICU4X code.

The current situation with Unicode 16 introducing characters with novel normalization behaviors delaying ICU4C's data update and that blocking ICU4X's data update

The tricky thing here is not just that it blocks our update: it's that Unicode expects implementors to have trouble with this update, and having time to fix things is crucial.

The third item would take the most effort to replicate from UCD data files without ICU4C.

Good news: the third item is not necessary to fix the problem we're facing: we can continue to do UTS 46 mappings via icuexportdata but move (2) over to datagen since we only need the single-character recursive stuff (and some other things) which can be done directly from data. (@eggrobin helped with this observation)

hsivonen commented 3 months ago

We don't need UTS 46 to do alpha testing on normalization, but the ICU4X normalizer data merges auxiliary tables for UTS 46 and the K normalizations, so for actual deployment, all the normalizations need to come from the same builder.

sffc commented 3 months ago

@robertbastian - Currently ICU4X uses data that is generated from ICU4C. This isn't great because (1) with every update we're blocked on ICU4C, and (2) we're not a clean reimplementation of the Unicode specification. For example, we pick up ICU4C bugs, we can't test ICU4C with ICU4X. I'm told that it might be difficult to re-implement this, but maybe we can tackle low-hanging fruit.
@Manishearth - One of the things is, most of the logic for icuexportdata lives in a part of ICU4C that is used by ICU4X, which means we have a lot of ICU4X-specific code that is maintained by ICU4X developers, which isn't great, because ICU4C developers aren't familiary with ICU4X APIs, and I've already had to spend time debugging icuexportdata. So from a philosophical point of view, it seems weird for us to maintain code in a strange place that we don't all understand.
@eggrobin - There's a recent development: the UTC has published, exceptionally, a complete UCD for the 16.0 Alpha with the explicit recommendation that implementers try out their Normalizer with this data, as a test suite, because we expect that this will blow up normalizers. We expect that this will blow up with ICU4X normalizer. But we cannot test this fix, despite checking in the fix to ICU4X, because we haven't gotten the data update from ICU4C. So it's hard for us to get ahead.
@macchiati - I think the only impediment is the amount of work to convert data into a format that can be used. I know @markusicu's parsing of properties are a real pain to parse UCD data from its raw form. The impediment is translating the raw forms that are suitable for implementations. To make it easier, if there are things we know where we can pre-process this data at the source that would make it easier for impls, we could propose that to the source TCs.
@markusicu - A few things. In the early days, when I was trying to get data for some properties, I had to read multiple files for one single property. Some of that has been mitigated by having derived files directly in UCD. Second, to build interesting data structures, we have to read data from multiple properties, and it has been awkward to do that from multiple data files. That's why many years ago I developed ppucd that consolidates all the properties into a single text file. And the C/C++ tools that parse the UCD actually parse the ppucd which makes this easier. So you could definitely re-write this with reasonable effort to read UCD proper, especially if you have a library that helps you read the data. Maybe we could collaborate on the Python script. But, there are other pieces of data that are more interesting or tricky to derive or produce. Maybe that applies to normalization. The most interesting thing to replace would be the builder for normalization data plus CLDR tailorings. Some properties could be reasonably straightforward. So in summary, some of these steps might be easy and others might be hard.
@eggrobin - As @markusicu is aware, the UCD has been getting cleaner over the years. The @missing syntax has been regularized, the derived files are much more pleasant to interact with than the older files. So I think writing a parser these days is easier than before, and I know that because Unicode Tools still needs to parse the old stuff. And I would agree with not doing it all in one go, but there is something that seems to be blocking testing in ICU4X, which is the actual normalization side.
@hsivonen - On the topic of parsing UCD, there is a crate called ucd parse that already has parsers for many of the UCD files, types that are a single row in the descriptions txt file. So there is already a crate for parsing UCD. Getting the properties into ICU4X that way wouldn't be that complicated. Also, the four main normalization forms are not so complicated either. What the builder wants to have is, an already-working normalizer that it can query, given this input character, what is its recursive canonical or compatibility composition/decomposition? So that's for the normalization. For UTS 46... ICU4X doesn't use it much, but I'm currently working on moving Firefox to use ICU4X instead of ICU4C for UTS 46, so I would very much like us to keep around the UTS 46 data. That is generated assuming that there is the ICU4C Normalizer that has UTS 46 normalization, and that normalization is queried in almost the same way as the other normalizers, which makes it harder to replicate. For Unicode 16, we could do the main normalizations, but if we want something that we use for ICU4X releases, we really should include UTS 46 as well. For example, we have tables that cross-reference data tables between the normalization forms, so we should generate them in the same place. For collator, at this point, I think it would make more sense to put effort into extending the ICU4X feature set than re-building the data builder.
@sffc - (1) for some pieces of data like the segmenter dictionaries, segmenter lstms, UCP tries, which have a very similar binary format, we could establish data sharing between C and X (2) I think the approach with the CPT builder is interesting, where we have complicated ICU4C code that we didn't want to port to Rust, which we embedded into Rust as a wasm file (or you can link a dylib). This could be a way to decouple at least some of this work, like reuse ppucd (which however is Python) (3) as @eggrobin said, UTC has gotten better at publishing nice data files, this is where we should apply more pressure; at the moment ICU is kind of the only user, ...... maybe things that are in ICU should be published by Unicode instead
@markusicu via meeting chat, re WASM: the code point trie builder is simpler than the normalization data builder, and much, much simpler than the collation builder parts
@robertbastian - Data customization is a selling point for ICU4X. Currently icuexportdata is very hard to customize, because you basically need to figure out how icuexportdata is generated. The collation tailorings for example are CLDR; if you generate ICU4X with a custom CLDR, you can't customize a collation tailoring, unless you also edit the ICU4C pipeline, which is hard.
@macchiati - What should come out of this meeting is policy directions, because some things are easy to do, and other things require more time, time which could be spent doing features. It should not be a "tomorrow, switch to X"; it should be incremental. The WASM thing is interesting, but I have some trepidation because if you find a bug and start crossing language boundaries, it is tricky. I think @sffc and I are both thinking about pushing the generation of implementation-ready data of whatever kind down a bit could make things easier for the structure that we have. Pushing some information down into CLDR is something we should look at. The same goes with normalization data and so on.
@hsivonen - All the stuff that icuexportdata exports is UCD-level things that really should not be customized except for, like, do I want to use this property in my app at all? The collation... all the other CLDR data, except for collation, already comes directly to ICU4X. The collation data, building that into the data structures, is so complicated that it wouldn't even make sense for me to build that export into icuexportdata; instead I hacked genrb to dump that data. Then I hacked the genuca tool which generates the root collation data in the right format so it also generates it for ICU4X. So I agree with @robertbastian about how this is bad, but also the customization issue is not relevant for the things that are easy enough to do, but it is relevant for collation, and re-implementing that is a really big deal. So maybe that could be fixed, but I think it isn't a priority relative to feature completeness.
@robertbastian - We use WASM for the codepointtrie builder, which has a well-defined API. Defining a nice API for all of this and keeping it stable for ICU4X to use is also nontrivial, I expect. To add to that, WASM binary sizes are not great; we don't want to ship hundreds of megabytes of ICU4C binaries.
@sffc - In the long run we might want to do this processing in Rust and have ICU4C depend on that. The codepointtrie builder only works because it is in a small, bounded portion of the ICU4C code base; I imagine doing it for the other parts could work but would require some effort.
@markusicu - the code point trie builder is simpler than the normalization data builder, and much, much simpler than the collation builder parts
@hsivonen - It sounds to me like the main motivation is testing Unicode 16 normalization. Even though I see that by the end of the month we expect to be able to assume that the ICU4C normalizer works. Would it alleviate concerns if before then I did a hack with special casing a few characters of interest in icuexportdata, build the data, and running the tests that exercise the characters that have a canonical composition with themselves?
@macchiati - I have to drop off, but wanted to note that there is a reference implementation for normalizaiton in the unicode tools; it would to be wired up to the right data, but might help in the testing.
@eggrobin - uuuh. U+1FADF blwb. I'd be uncomfortable with picking out a handful of edge cases, the tests are very exhaustive and give us the desired confidence.
@robertbastian - It sounds like we're in agreement that we don't want ICU4X to depend on ICU4C in the long term. Whether we want to move code into ICU4X or into CLDR/UTC I don't know. For the short term, it sounds like for the components where this would help us most, normalization and collation, are the most complicated. So I don't know we can do much in the short term.
@eggrobin - It sounds like normalization's only complicated for UTS 46
@hsivonen - ... it needs to see when the single-step canonicalization is different from the recursive canonicalization. If we don't care about testing that... Some tools that are used to test UCD itself, if they have a from-first-principles normalizer that can produce recursive results from non-recursive data... it should be possible to do this. Those tools are in Java. are those tools allowed to use the ICU4J codepointtrie builder?
@markusicu - It sounds like the ask is getting at the recursive decompositions without going through ICU. If people find that useful, we could add the recursive decompositions into the derived file in the UCD.
@hsivonen - Okay, so if we have those, how to we package those into the files ICU4X can ingest? We need to build the codepointtries.

(@markusicu and @hsivonen deep dive on normalization data pipeline)

@markusicu - part of what I'm planning to have by eom is a branch where some of the Unicode 16 data is ingested
@hsivonen - once you have that branch I can look into ..? (more deep dive)
@markusicu: there is an ICU4C data structure issue, but it may be hidden by another bug, and it should not affect the Decompositions.
@markusicu - Case mapping is simpler than normalization
@Manishearth - We can implement case mapping in datagen based on the current UCD. Theoretically we are consuming a weird-looking property. This code is very ICU4C-specific. It would be nice to decouple the generation step.

Conclusions:

Case Mapping and Folding:
- There are a few paths for refactoring the casemapping pipeline
  - Move casepropsbuilder.cpp's output to UCD. Probably not great since it's very very ICU4*-specific
  - Reimplement casepropsbuilder.cpp in ICU4X datagen
  - move casepropsbuilder.cpp into an ICU-agnostic tool that ICU4X and ICU4C an both run
- casepropsbuilder builds a special packed property. it mostly behaves like a regular property, but it is derived from a ton of properties, so building it is a bunch of work (but straightforward work)
Collation:
- Given that collations are the hardest thing to build and to pull out of its current pipeline (which is not icuexportdata), we should not prioritize changing that part of the pipeline, despite collations being something that could be customized.
Normalization:
- 16.0 testing will be moot once @markusicu publishes the updated data.
- In the long run there could be a path to dropping the ICU4C dependency by asking the UTC to publish recursively decoposed characters and the data that UTS 46 needs, or by just directly implementing the recursive decomposition.
Properties:
- Could play around with parsing the UCD; doesn't seem like a lot of work. Other crates in the Rust ecosystem are already doing it.
Segmentation dictionaries:
- Maybe move them to a "complex script segmentation resources" project in unicode-org, which is also where we could publish the ML models.

The above bullet points can be actioned, but are subject to the normal ICU4X prioritization process.

LGTM: @sffc @eggrobin @markusicu @Manishearth @robertbastian

Furthermore, making ICU4X fully independent of ICU4C, and vice-versa, should be our long-term goal. Both projects should read directly from UCD or other shared sources, and those sources should ship data useful for clients.

LGTM: @robertbastian @Manishearth, @eggrobin (@sffc, @hsivonen, @markusicu in principle)

unicode-org / icu4x

Reduce ICU4X's dependence on ICU4C data #4602