Consider supporting three layers of collation data for search collations

hsivonen commented 1 year ago

In sorting, there are two layers of data: The root collation and, optionally, a language-specific tailoring overlay.

In search, there are logically three layers of data: the root for sorting, a search root overlaid on that, and then, optionally, a language-specific tailoring.

However, the implementation only admits two layers, so for each language that's supposed to reuse its sort tailoring for searching, we end up generating a search tailoring that contains a merge of a copy of the search root and a copy of the sort tailoring for the language. This is obviously bad for data size.

An obvious solution would be to allow three layers: root, search root, and search tailoring. However, this would make search perform worse, since the common case would fall back twice.

(An alternative that I'm considering for Firefox in the context of ICU4C for the time being is to omit the search root when a search tailoring exists and to use the corresponding sort tailoring as-is. That is, for the Latin-script languages that have special rules about which diacritics not to ignore in diacritic-insensitive search, one would lose the fuzziness for the Arabic and Thai scripts. And modern Hangul, but I don't understand the use case for the modern Hangul bits in the search root.)

markusicu commented 1 year ago

Three layers would be easy in runtime lookups.

The trick is in the tailoring builder:

The root data is constructed with gaps for tailoring. When you sort after something, you can insert one or more new weights at any level without having to move root table weights for the order-following characters and their CEs.
The root data includes a table of CEs so that the tailoring builder can find the gaps.
The tailoring builder uses these gaps and optimizes for short weights. It does not leave further gaps.
For an intermediate search tailoring, we would need to subdivide gaps and still leave gaps for further tailoring. Somehow.
- Maybe for one intermediate layer: If we need to put N new weights into a gap, we could over-allocate by a factor of F, and actually use only N evenly space weights. We would also need to output a modified CEs table for finding further gaps for third-level tailorings.

@macchiati FYI

hsivonen commented 1 year ago

The issue of gaps in the builder phase is a problem in the general case in principle but, ignoring Hangul, not a present problem on CLDR trunk, right?

That is, the search root on CLDR trunk involves Thai-like scripts, the Arabic script, the Hangul script, and one symbol (Why aren't = and ≠ primary-different in the sort root?), and the search tailorings involve the Latin script or Hangul.

So ignoring Hangul, I'd expect the way the Latin-related tailorings use gaps not to collide with how Thai-like script, the Arabic script, or the one symbol use gaps in the root.

For Hangul, this comes back to what the use case for the Hangul bits in the search root is. If the search root didn't have tailorings for Hangul, it seems that the Korean search tailorings wouldn't try to use the same gaps.

markusicu commented 1 year ago

Outside of ICU4X we usually try to make code & data work according to the algorithms, not according to what the known data looks like right now. ICU4C/J allow users to build custom tailorings at build time and at runtime. It should be possible to tailor relative to something that is tailored in the intermediate root search.

hsivonen commented 1 year ago

The data size for search tailorings is pretty bad and ICU4X doesn't allow run-time tailoring. While I agree that in principle allowing arbitrary tailoring relative to the search root would be proper, I think it's relevant to consider what can be reasonably guessed about the future direction of CLDR and how the data size could be reduced.

I infer that historically sort tailorings for Latin-script languages that analyze technically accented letters as semantically base letters started being used for search and then later the search root was added mainly to make diacritic-insensitive search also insensitive to certain Arabic marks.

It seems to me that historically, script-specific things to into the search root, and the Latin-script is the script where things are language-specific. The two exceptions are: the Hebrew script (punctuation only; now hoisted to the sort root and, therefore, no longer a concern for search tailoring) and the Hangul script. I haven't seen the minutes for the 2010-09-29 CLDR meeting where the Hangul split seems to have been decided, but I suspect that the split of modern Hangul in the root and archaic Hangul in the Korean tailoring is a size-related compromise that breaks the principle that script-level things go into the search root and that leaves the search root in a weird state where there is a size cost to the root but the part that remains in the root doesn't serve a useful use case.

Furthermore, it looks like for the Latin-script languages, with the exception of Catalan and Slovak, the search collation is a combination of the search root and one of the sort tailorings. If the run-time part allowed three levels of data, it would make sense to use the exact sort tailoring data as one of the layers without moving things within the sort root gaps.

Therefore, I think it's relevant to speculate about the future of CLDR: Can it be expected that (apart from the Hangul split that doesn't seem to follow a principle of putting script-level things in the search root and that at least to me so for seems questionable in terms of use case backing) in future CLDR

Script-level search tailoring goes into the search root (with cross-script symbols treated as a script for the purpose of this item).
There won't be script-level search tailoring for the Latin script.
Apart from searchjl, language-level search tailoring will remain a Latin-script-only phenomenon.

?

Although the past isn't a guarantee of the future, so far it seems reasonable to expect this. If this can be expected, it would be a notable data size win to have up to three run-time layers of data, to have one copy of the search root, to build search-specific tailoring data exclusive of the search root data for Catalan and Slovak, for other Latin-script languages to load a sort tailoring as the search language tailoring data layer, and to figure out something Hangul-specific (possibly changing the semantics relative to current CLDR unless use case backing for the current state is shown).

hsivonen commented 1 year ago

Also, if the data design is such that it's possible to revert a given data entry to the current scheme (two layers: root + merged search root and search tailoring), the risk of the three-layer approach reduces to going back to larger data. That is, if a future version of CLDR had a weight allocation conflict between the search root and a language-specific tailoring, then that case could be built as current-style data that merges a copy of the search root and the language-specific tailoring.

That would still allow the data saving for the cases where such conflict doesn't arise.

sffc commented 1 year ago

Seems like Priority Backlog because we would like to reduce data redundancy as much as possible.

hsivonen commented 1 year ago

The data size for search tailorings is pretty bad

From the Gecko bug: That's a 152 KB reduction for unshipping the root inheritance and a 200 KB additional reduction for unshipping Korean search. 352 KB total.

That doesn't looks like much on today's desktop, but considering what behavior delta that size buys, it's a lot.

unicode-org / icu4x

Consider supporting three layers of collation data for search collations #3178