Open robertbastian opened 4 months ago
Some background here:
The UCD is heavily pre-processed in the ICU4C data build into a form known as ppucd
. A decision was made early on that it was less work and more maintainable to maintain the UCD pre-processing code in one place, which is why we initially created icuexportdata. We then leveraged this machinery to capture similar pre-processing that happens for collation/normalization/casemap data.
One potential advantage to leveraging ICU4C for these larger property blobs is that it paves the way for us to potentially share data files for some of these structures between C and X.
So while I'm not opposed to heading in this direction, whoever takes this issue should research exactly the nature of the machinery we're using in ICU4C, study the impact on cross-compatible data files, and create more bite-sized milestones.
ppucd
seems to be exactly the kind of code that should not be in ICU4X's data pipeline. It's an assortment of python scripts that are tightly coupled to ICU4C.
Other Rust users are already reading the UCD, so it can't be that hard?
One potential advantage to leveraging ICU4C for these larger property blobs is that it paves the way for us to potentially share data files for some of these structures between C and X.
I don't see what this has to do with runtime representation. Neither the current text files in icuexportdata nor the UCD text files are a runtime format.
Can you confirm whether collation data is CLDR-derived?
I would be in favor of this in the long run. I'm not sure how much work it is and if it's worth it.
Can you confirm whether collation data is CLDR-derived?
The root collation is built separately from the tailorings. The root is built from DUCET with LDML root refinements applied. The tool that builds it is genuca
, which has its own special Bazel build instead of using the usual ICU4C build system. The root is built in four configurations: Two axes: ICU4C vs. ICU4X and unihan vs. implicithan.
Once the root has been built, genrb
can build the tailorings from CLDR data (and also write the root as TOML). These are built in an ICU4X-specific mode that omits the canonical closure and Latin fast path data.
Of the types of data mentioned in this issue, building the collation data without ICU4C would be by far the largest effort.
The second-largest effort would be much, much simpler, but still complicated. The second-largest effort would be building the UTS 46 data into the form of a special normalization.
Discuss with:
Optional:
Quick notes
Thanks.
For normalization especially I would somewhat prefer to rely on ppucd or directly on UCD. The current situation is extremely suboptimal: the normalization properties are exported as a part of icuexportdata, ICU4C-using C++ code that is not particularly easy to understand. The group of people that needs to debug that code (ICU4X) is not the group of people that can easily understand it (ICU4C devs), and I've already had to spend a bunch of time fixing segfaults and other issues in it.
Still, I'm not convinced that the code will be equal complexity if maintained by us in ICU4X datagen: the ICU4C code is able to invoke normalizer, whereas we would not be able to invoke our own normalizer and may have to do some manual work here. I'm hoping we can maintain the same complexity (it has to exist somewhere) but I'm not fully clear on everything that code does to be sure.
Collation is messier and I'm less sure if we should try to reduce that ICU4C dependency yet.
Is there an opportunity to use a codepointtrie_builder
-style approach, where we take the core algorithms from C++ and call them from ICU4X via WASM or FFI?
I don't think so, because the ICU4C normalizer will rely on ICU4C normalizer data.
(and that's the main "core algorithm" of consequence)
ICU4C-using C++ code that is not particularly easy to understand.
This arises from getting the data into the form that the ICU4X normalizer expects and, potentially, from a certain lack of polish. It doesn't arise from C++ or ICU4C.
The group of people that needs to debug that code (ICU4X) is not the group of people that can easily understand it (ICU4C devs)
I wrote that code, so while we may have a truck number problem, I don't think that analysis of who debugs and who understands accurately describes the situation.
That code needs 3 things from ICU4C:
The third item would take the most effort to replicate from UCD data files without ICU4C.
Overall, I think it would be ideal if ICU4X was self-hosted, but as a practical matter, I think we should put engineering effort into reaching ECMA-402 coverage of the ICU4X feature set instead of putting engineering effort into decoupling the data pipeline from ICU4C at this time.
The current situation with Unicode 16 introducing characters with novel normalization behaviors delaying ICU4C's data update and that blocking ICU4X's data update makes the whole thing look scarier than it is in the usual case.
Is there an opportunity to use a
codepointtrie_builder
-style approach, where we take the core algorithms from C++ and call them from ICU4X via WASM or FFI?
No. Wasm wouldn't solve the problem that the ICU4C normalizer didn't anticipate the novel normalization behaviors that are now blocking the normalization data update.
This arises from getting the data into the form that the ICU4X normalizer expects and, potentially, from a certain lack of polish. It doesn't arise from C++ or ICU4C.
When I was previously fixing bugs here it did involve chasing down ICU4C APIs to understand their nuances.
Like, my experience with this code is precisely that I needed to be an ICU4C expert to fix it.
And there is very little reason for ICU4C devs to be looking at this code; it is almost always going to be ICU4X code.
The current situation with Unicode 16 introducing characters with novel normalization behaviors delaying ICU4C's data update and that blocking ICU4X's data update
The tricky thing here is not just that it blocks our update: it's that Unicode expects implementors to have trouble with this update, and having time to fix things is crucial.
The third item would take the most effort to replicate from UCD data files without ICU4C.
Good news: the third item is not necessary to fix the problem we're facing: we can continue to do UTS 46 mappings via icuexportdata but move (2) over to datagen since we only need the single-character recursive stuff (and some other things) which can be done directly from data. (@eggrobin helped with this observation)
We don't need UTS 46 to do alpha testing on normalization, but the ICU4X normalizer data merges auxiliary tables for UTS 46 and the K normalizations, so for actual deployment, all the normalizations need to come from the same builder.
ppucd
that consolidates all the properties into a single text file. And the C/C++ tools that parse the UCD actually parse the ppucd
which makes this easier. So you could definitely re-write this with reasonable effort to read UCD proper, especially if you have a library that helps you read the data. Maybe we could collaborate on the Python script. But, there are other pieces of data that are more interesting or tricky to derive or produce. Maybe that applies to normalization. The most interesting thing to replace would be the builder for normalization data plus CLDR tailorings. Some properties could be reasonably straightforward. So in summary, some of these steps might be easy and others might be hard.@missing
syntax has been regularized, the derived files are much more pleasant to interact with than the older files. So I think writing a parser these days is easier than before, and I know that because Unicode Tools still needs to parse the old stuff. And I would agree with not doing it all in one go, but there is something that seems to be blocking testing in ICU4X, which is the actual normalization side.ucd parse
that already has parsers for many of the UCD files, types that are a single row in the descriptions txt file. So there is already a crate for parsing UCD. Getting the properties into ICU4X that way wouldn't be that complicated. Also, the four main normalization forms are not so complicated either. What the builder wants to have is, an already-working normalizer that it can query, given this input character, what is its recursive canonical or compatibility composition/decomposition? So that's for the normalization. For UTS 46... ICU4X doesn't use it much, but I'm currently working on moving Firefox to use ICU4X instead of ICU4C for UTS 46, so I would very much like us to keep around the UTS 46 data. That is generated assuming that there is the ICU4C Normalizer that has UTS 46 normalization, and that normalization is queried in almost the same way as the other normalizers, which makes it harder to replicate. For Unicode 16, we could do the main normalizations, but if we want something that we use for ICU4X releases, we really should include UTS 46 as well. For example, we have tables that cross-reference data tables between the normalization forms, so we should generate them in the same place. For collator, at this point, I think it would make more sense to put effort into extending the ICU4X feature set than re-building the data builder.(@markusicu and @hsivonen deep dive on normalization data pipeline)
Conclusions:
unicode-org
, which is also where we could publish the ML models.The above bullet points can be actioned, but are subject to the normal ICU4X prioritization process.
LGTM: @sffc @eggrobin @markusicu @Manishearth @robertbastian
Furthermore, making ICU4X fully independent of ICU4C, and vice-versa, should be our long-term goal. Both projects should read directly from UCD or other shared sources, and those sources should ship data useful for clients.
LGTM: @robertbastian @Manishearth, @eggrobin (@sffc, @hsivonen, @markusicu in principle)
It would be nice to cut out the middle man and construct as much data as possible directly from "the source". The
icuexportdata
we currently use contains:ucd_parse
crate) and generate the data from it.I think it's desirable for ICU4X to be as independent of ICU4C as possible, in order to identify and upstream any custom ICU4C behaviour.