unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.36k stars 174 forks source link

Consider exposing sort keys #2689

Open Lucretiel opened 2 years ago

Lucretiel commented 2 years ago

It would be nice if icu_collator had exported sort key types, in addition to Collator::compare. This would allow sort keys to be computed once (via sort_by_key or similar utilities), after which they only need to be compared (which I understand to be much simpler than the computation step, and also not locale-sensitive).

markusicu commented 2 years ago

There are pros and cons.

See

Lucretiel commented 2 years ago

That does all make sense. The main use case I'm interested isn't exactly a huge list of strings as much as a frequently sorted list of strings (essentially a React style UI system where elements are semi-frequently recomputed). However, I definitely understand that it may not be worth it (especially for strings that are more commonly dissimilar and therefore usually don't need a full sort key). I'll probably start by forking icu4x and doing some private benchmarking to see if there's any benefit before I re-open the proposal.

mpalmer commented 1 year ago

Any chance this could be considered for re-opening? I've got a use case where I can't do direct comparisons of the strings, and so sort keys are the only available solution.

markusicu commented 1 year ago

I've got a use case where I can't do direct comparisons of the strings, and so sort keys are the only available solution.

Could you please say more about your use case and why you “can't do direct comparisons of the strings”?

mpalmer commented 1 year ago

tl;dr: encryption.

Deep nerd dive:

I can't do direct comparisons of the strings because they're encrypted, and the service performing the comparison doesn't have the key to decrypt the strings in order to do a direct comparison (beyond the fact that decryption-for-comparison would be somewhat of a performance buzz-kill). Thus, just using compare isn't an option.

Of course, just encrypting the string and leaving the sort key unencrypted would be a massive information leak (because the sort key is a fairly close representation of the source string), but I'm also using an encryption scheme (order-revealing encryption) that can encrypt the numeric values in the sort key and allow them to be compared without revealing directly what the numbers are, which provides the ability to order the strings without directly knowing what they are.

Using order-revealing encryption on the source strings naturally won't work, because the numeric values of the octets in the string (which is all that ORE is able to work with) don't correspond to the desired sort order. The transformation provided by generating the sort key is precisely the operation I need to turn "unicode string" into "sortable value".

I've got a proof-of-concept of all this working, using the get_sort_key function from the rust_icu bindings, but it would be a lot simpler if I could stick to native Rust code, and avoid the need for users to manage having a compatible version of libicu on hand -- doubly so if they're using pre-built binaries.

(This is all to add orderable strings to the Enquo Project, in case you're after even more of a deep-dive into what I'm up to)

sffc commented 1 year ago

Re-opening to seek additional feedback from @markusicu and @hsivonen

iantbutler01 commented 1 year ago

I'd also like to weigh in, sort keys are useful in data structures where you need a binary comparable key. I'm currently writing an adaptive radix tree implementation as an underlying index for a full text search database that needs just that to properly work on unicode strings and I'll have to use the rust_icu bindings for now which exposes the ability to generate the sort key.

markusicu commented 1 year ago

Re-opening to seek additional feedback from @markusicu and @hsivonen

I don’t have much to add to the discussion. Sort keys have a cost (extra code, processing time, storage) but some users need or want them.

We want to discourage the desire for sort keys for performance reasons alone, especially without the benefit of a performance test, but if there is real demand for use cases beyond “I have a hunch that sort keys would be faster” then we could consider them.

The comment about “an adaptive radix tree implementation” sounds like a request for something fancier than just “compute the complete sort key for this string”, as in ICU4C ucol_nextSortKeyPart().

If & when ICU4X adds APIs and code for sort keys, it would be nice for it to generate the same sort keys as ICU4C/J, at least for whole-string-to-complete-sort-key functions.

iantbutler01 commented 1 year ago

Nope I just need the ability to compute a sort key from a string. To be precise I need the functionality equivalent to ucol_getSortKey for this purpose. Not having that rules out the use of this library in favor of the C/C++ bindings for me.

mpalmer commented 1 year ago

If & when ICU4X adds APIs and code for sort keys, it would be nice for it to generate the same sort keys as ICU4C/J

Undoubtedly compatibility with ICU4C/J would be useful, as I'm sure ICU4X is generally trying to be as compatible with ICU4C/J as possible. For my purposes, though, at least, I don't require compatibility between implementations, so my vote (for whatever little it may be worth) would be that if compatibility makes implementation harder, drop compatibility.

While that would, in general, make future compatibility more difficult (re-aligning sort keys to match ICU4C/J-generated keys would break comparison with keys previously generated by ICU4X), using a separate ICU4X-specific interface for ICU4X-specific sort keys would make this approach less fraught than it would otherwise be.

hsivonen commented 1 year ago

Re-opening to seek additional feedback from @markusicu and @hsivonen

sffc commented 1 year ago

@markusicu @hsivonen Do you consider the use cases raised in this thread to be compelling?

@iantbutler01: "sort keys are useful in data structures where you need a binary comparable key. I'm currently writing an adaptive radix tree implementation as an underlying index for a full text search database that needs just that to properly work on unicode strings"

@mpalmer: "I can't do direct comparisons of the strings because they're encrypted, and the service performing the comparison doesn't have the key to decrypt the strings in order to do a direct comparison (beyond the fact that decryption-for-comparison would be somewhat of a performance buzz-kill). Thus, just using compare isn't an option."

sffc commented 1 year ago

I will point out that it is error-prone to use collation ordering as a key in a database. You need to be ready to rebuild your indexes whenever upgrading the Unicode/ICU4X data version and ensure that it's not possible for a client to access the database with the wrong Unicode version.

markusicu commented 1 year ago

I will point out that it is error-prone to use collation ordering as a key in a database. You need to be ready to rebuild your indexes whenever upgrading the Unicode/ICU4X data version and ensure that it's not possible for a client to access the database with the wrong Unicode version.

Yes, that's one of the costs. Some people are evidently willing to live with the costs.

iantbutler01 commented 1 year ago

I'm not sure it's intended but so far this conversation comes off a bit derisive towards me and or my use case. I discussed it with someone else working with me and they agreed, and at least from my perspective It's really not a good look.

If you have alternative suggestions to correctly handle unicode, I'm all ears. I'm happy to point out to you the spot in the paper that derived the ART where the unicode sort key and collation is mentioned directly, perhaps they were mistaken, but I don't have an alternative to go on at the moment.

For now I'm going to be using the c/c++ bindings regardless, but saw the issue and thought I would offer a perspective.

markusicu commented 1 year ago

Just to be clear, from my perspective:

faassen commented 1 year ago

I can throw in another use case, I think. This is part of the XPath standard:

https://www.w3.org/TR/xpath-functions-31/#func-collation-key

See the notes, which discuss using this for the purposes of lookup.

If a "collation key" is indeed the same as a sort key, then if ICU4X is used to implement XPath, access to sort keys would be very helpful in order to fully support the spec. In fact this XPath scenario is not hypothetical -- I'm actively working on this.

Note that the standard is explicit that sort keys only exist during a single execution scope, i.e. they're invalidated each run.

hsivonen commented 1 year ago

Do you consider the use cases raised in this thread to be compelling?

Sorry about the late reply. I'm OK with taking a contribution to add sort key exposure taking the points from my previous comment and @markusicu 's previous comment into account

I'm not sure it's intended but so far this conversation comes off a bit derisive towards me and or my use case.

I'm sorry. I didn't mean to suggest anything like that—just that 1) I don't volunteer to implement this at this time, 2) for performance, there's a higher-priority issue to address, and 3) we should be extremely clear in the documentation about considerations why it might not be appropriate for an app to use sort keys.

This is part of the XPath standard

Thank you for pointing this out. I was quite concerned that I had missed a use by Firefox when I saw your comment, but fortunately browsers only support XPath 1.

I'm quite skeptical of processes that use XPath having the kind of lifetimes and numbers of comparisons that computing a sort key is justified, but whether or not exposing sort keys in XPath is a good idea, it's good to know that XPath has this dependency.

faassen commented 1 year ago

Sorry to cause you concern about XPath support in browsers!

I'm quite skeptical of processes that use XPath having the kind of lifetimes and numbers of comparisons that computing a sort key is justified, but whether or not exposing sort keys in XPath is a good idea, it's good to know that XPath has this dependency.

I think the XPath spec (the library portion) has been influenced by the capabilities of ICU4J.

The motivation for this facility is described in the "notes" section:

https://www.w3.org/TR/xpath-functions-31/#func-collation-key

and is basically to use this as a collation-dependent hashmap key. I can't judge myself how useful that is, so I'll defer to your skepticism. I'll note however that this same specification also provides the function library available to XQuery, and with XQuery the lifetimes and numbers of comparisons are likely to be much bigger.

I'm just trying to create a compliant XPath implementation and this function is one of the stumbling blocks.

sffc commented 10 months ago

Based on the above discussion, it sounds like we would accept a pull request exposing sort keys with the caveats that @hsivonen and @markusicu listed above. I have added the "help wanted" label to communicate this.