Consider exposing sort keys

Lucretiel commented 2 years ago

It would be nice if icu_collator had exported sort key types, in addition to Collator::compare. This would allow sort keys to be computed once (via sort_by_key or similar utilities), after which they only need to be compared (which I understand to be much simpler than the computation step, and also not locale-sensitive).

markusicu commented 2 years ago

There are pros and cons.

compare() can bail out as soon as there is a difference between the two strings.
Sort keys have to be computed for the whole string and all requested levels (according to the strength), and since they tend to be stored semi-permanently, they want to be compressed to be small.
As a result, computing sort keys is relatively slow, and gets amortized only when you sort really long lists of strings.
And sort keys need to be recomputed whenever the locale or the Unicode version or the tailoring data changes.

See

Lucretiel commented 2 years ago

That does all make sense. The main use case I'm interested isn't exactly a huge list of strings as much as a frequently sorted list of strings (essentially a React style UI system where elements are semi-frequently recomputed). However, I definitely understand that it may not be worth it (especially for strings that are more commonly dissimilar and therefore usually don't need a full sort key). I'll probably start by forking icu4x and doing some private benchmarking to see if there's any benefit before I re-open the proposal.

mpalmer commented 1 year ago

Any chance this could be considered for re-opening? I've got a use case where I can't do direct comparisons of the strings, and so sort keys are the only available solution.

markusicu commented 1 year ago

I've got a use case where I can't do direct comparisons of the strings, and so sort keys are the only available solution.

Could you please say more about your use case and why you “can't do direct comparisons of the strings”?

mpalmer commented 1 year ago

tl;dr: encryption.

Deep nerd dive:

I can't do direct comparisons of the strings because they're encrypted, and the service performing the comparison doesn't have the key to decrypt the strings in order to do a direct comparison (beyond the fact that decryption-for-comparison would be somewhat of a performance buzz-kill). Thus, just using compare isn't an option.

Of course, just encrypting the string and leaving the sort key unencrypted would be a massive information leak (because the sort key is a fairly close representation of the source string), but I'm also using an encryption scheme (order-revealing encryption) that can encrypt the numeric values in the sort key and allow them to be compared without revealing directly what the numbers are, which provides the ability to order the strings without directly knowing what they are.

Using order-revealing encryption on the source strings naturally won't work, because the numeric values of the octets in the string (which is all that ORE is able to work with) don't correspond to the desired sort order. The transformation provided by generating the sort key is precisely the operation I need to turn "unicode string" into "sortable value".

I've got a proof-of-concept of all this working, using the get_sort_key function from the rust_icu bindings, but it would be a lot simpler if I could stick to native Rust code, and avoid the need for users to manage having a compatible version of libicu on hand -- doubly so if they're using pre-built binaries.

(This is all to add orderable strings to the Enquo Project, in case you're after even more of a deep-dive into what I'm up to)

sffc commented 1 year ago

Re-opening to seek additional feedback from @markusicu and @hsivonen

iantbutler01 commented 1 year ago

I'd also like to weigh in, sort keys are useful in data structures where you need a binary comparable key. I'm currently writing an adaptive radix tree implementation as an underlying index for a full text search database that needs just that to properly work on unicode strings and I'll have to use the rust_icu bindings for now which exposes the ability to generate the sort key.

markusicu commented 1 year ago

Re-opening to seek additional feedback from @markusicu and @hsivonen

I don’t have much to add to the discussion. Sort keys have a cost (extra code, processing time, storage) but some users need or want them.

We want to discourage the desire for sort keys for performance reasons alone, especially without the benefit of a performance test, but if there is real demand for use cases beyond “I have a hunch that sort keys would be faster” then we could consider them.

The comment about “an adaptive radix tree implementation” sounds like a request for something fancier than just “compute the complete sort key for this string”, as in ICU4C ucol_nextSortKeyPart().

If & when ICU4X adds APIs and code for sort keys, it would be nice for it to generate the same sort keys as ICU4C/J, at least for whole-string-to-complete-sort-key functions.

iantbutler01 commented 1 year ago

Nope I just need the ability to compute a sort key from a string. To be precise I need the functionality equivalent to ucol_getSortKey for this purpose. Not having that rules out the use of this library in favor of the C/C++ bindings for me.

mpalmer commented 1 year ago

If & when ICU4X adds APIs and code for sort keys, it would be nice for it to generate the same sort keys as ICU4C/J

Undoubtedly compatibility with ICU4C/J would be useful, as I'm sure ICU4X is generally trying to be as compatible with ICU4C/J as possible. For my purposes, though, at least, I don't require compatibility between implementations, so my vote (for whatever little it may be worth) would be that if compatibility makes implementation harder, drop compatibility.

While that would, in general, make future compatibility more difficult (re-aligning sort keys to match ICU4C/J-generated keys would break comparison with keys previously generated by ICU4X), using a separate ICU4X-specific interface for ICU4X-specific sort keys would make this approach less fraught than it would otherwise be.

hsivonen commented 1 year ago

Re-opening to seek additional feedback from @markusicu and @hsivonen

We shouldn't add sort keys at this time on performance grounds, as there is an obviously-imaginable unimplemented performance improvement that we should implement but haven't yet: Code unit-level identical prefix skipping (hopefully without new data structures): #2712 (The ICU4X collator trades away performance relative to ICU4C in the interest of size on three points: no separate data structure for identical prefix skipping, no canonical closure in data, and no separate "fast Latin" data.)
The easiest way to implement sort keys would to avoid inventing a new scheme and to instead directly port the ICU4C algorithm for creating a collation key from a sequence of collation elements. It would still be prudent not to advertise compatibility: See the next point.
We should be extremely clear that updates to ICU4X collator code and/or to the collation data invalidate previously-created and stored collation keys. (And even without collation keys, any kind of collation-based storage order has to be considered invalidated by an ICU4X/data update. That is, even without collation keys, building and storing a search tree based on the collation order is problematic. This isn't specific to ICU4X: This also applies to ICU4C and glibc, but evidence suggests that people still build stuff on top of ICU4C and glibc as if the order was frozen across updates.)
For FFI purposes, we should consider if the last byte of collation keys should be zero to have C strcmp compatibility without an additional copy even if the additional zero byte would be unidiomatic and wasteful from a strict Rust perspective.
If there are non-performance use cases (or if there remain performance use cases after implementing identical prefix skipping), I'm not opposed to a contribution to add sort keys (by porting the algorithm from ICU4C), but I don't volunteer to do the implementation myself at this time.

sffc commented 1 year ago

@markusicu @hsivonen Do you consider the use cases raised in this thread to be compelling?

@iantbutler01: "sort keys are useful in data structures where you need a binary comparable key. I'm currently writing an adaptive radix tree implementation as an underlying index for a full text search database that needs just that to properly work on unicode strings"

@mpalmer: "I can't do direct comparisons of the strings because they're encrypted, and the service performing the comparison doesn't have the key to decrypt the strings in order to do a direct comparison (beyond the fact that decryption-for-comparison would be somewhat of a performance buzz-kill). Thus, just using compare isn't an option."

sffc commented 1 year ago

I will point out that it is error-prone to use collation ordering as a key in a database. You need to be ready to rebuild your indexes whenever upgrading the Unicode/ICU4X data version and ensure that it's not possible for a client to access the database with the wrong Unicode version.

markusicu commented 1 year ago

I will point out that it is error-prone to use collation ordering as a key in a database. You need to be ready to rebuild your indexes whenever upgrading the Unicode/ICU4X data version and ensure that it's not possible for a client to access the database with the wrong Unicode version.

Yes, that's one of the costs. Some people are evidently willing to live with the costs.

iantbutler01 commented 1 year ago

I'm not sure it's intended but so far this conversation comes off a bit derisive towards me and or my use case. I discussed it with someone else working with me and they agreed, and at least from my perspective It's really not a good look.

If you have alternative suggestions to correctly handle unicode, I'm all ears. I'm happy to point out to you the spot in the paper that derived the ART where the unicode sort key and collation is mentioned directly, perhaps they were mistaken, but I don't have an alternative to go on at the moment.

For now I'm going to be using the c/c++ bindings regardless, but saw the issue and thought I would offer a perspective.

markusicu commented 1 year ago

Just to be clear, from my perspective:

We try to steer people away from sort keys if they don't really need them.
For some use cases, you do need sort keys.
I have reimplemented collation, and sort key generation, for ICU4C and ICU4J some years ago.
I don't question a full-featured collation service providing functions to generate sort keys. (There are various other ICU4C/J collation APIs that I would have liked to drop if we didn't have stable APIs.)
I don't currently have the bandwidth to volunteer working on an ICU4X port.

faassen commented 1 year ago

I can throw in another use case, I think. This is part of the XPath standard:

https://www.w3.org/TR/xpath-functions-31/#func-collation-key

See the notes, which discuss using this for the purposes of lookup.

If a "collation key" is indeed the same as a sort key, then if ICU4X is used to implement XPath, access to sort keys would be very helpful in order to fully support the spec. In fact this XPath scenario is not hypothetical -- I'm actively working on this.

Note that the standard is explicit that sort keys only exist during a single execution scope, i.e. they're invalidated each run.

hsivonen commented 1 year ago

Do you consider the use cases raised in this thread to be compelling?

Sorry about the late reply. I'm OK with taking a contribution to add sort key exposure taking the points from my previous comment and @markusicu 's previous comment into account

I'm not sure it's intended but so far this conversation comes off a bit derisive towards me and or my use case.

I'm sorry. I didn't mean to suggest anything like that—just that 1) I don't volunteer to implement this at this time, 2) for performance, there's a higher-priority issue to address, and 3) we should be extremely clear in the documentation about considerations why it might not be appropriate for an app to use sort keys.

This is part of the XPath standard

Thank you for pointing this out. I was quite concerned that I had missed a use by Firefox when I saw your comment, but fortunately browsers only support XPath 1.

I'm quite skeptical of processes that use XPath having the kind of lifetimes and numbers of comparisons that computing a sort key is justified, but whether or not exposing sort keys in XPath is a good idea, it's good to know that XPath has this dependency.

faassen commented 1 year ago

Sorry to cause you concern about XPath support in browsers!

I'm quite skeptical of processes that use XPath having the kind of lifetimes and numbers of comparisons that computing a sort key is justified, but whether or not exposing sort keys in XPath is a good idea, it's good to know that XPath has this dependency.

I think the XPath spec (the library portion) has been influenced by the capabilities of ICU4J.

The motivation for this facility is described in the "notes" section:

https://www.w3.org/TR/xpath-functions-31/#func-collation-key

and is basically to use this as a collation-dependent hashmap key. I can't judge myself how useful that is, so I'll defer to your skepticism. I'll note however that this same specification also provides the function library available to XQuery, and with XQuery the lifetimes and numbers of comparisons are likely to be much bigger.

I'm just trying to create a compliant XPath implementation and this function is one of the stumbling blocks.

sffc commented 10 months ago

Based on the above discussion, it sounds like we would accept a pull request exposing sort keys with the caveats that @hsivonen and @markusicu listed above. I have added the "help wanted" label to communicate this.

unicode-org / icu4x

Consider exposing sort keys #2689