Almost all functions in FO that must process multiple string items, can have as a parameter only a single collation

dnovatchev commented 2 months ago

The problem:

At present the only XPath 4 function (that I am aware of) that can process multiple strings and use multiple collations (a specific collation for a specific string comparison) is fn:sort.

Some very important functions, such as fn:deep-equal and fn:compare can have only one collation as a parameter.

This means that when we are comparing sequences of items which contain multiple strings each of which could need to be handled in a specific collation, we are not able to provide all such collations (but are providing just a single collation) to the comparing function - fn:deep-equal or fn:compare.

The end result is that all string comparisons will be done using that single collation and may not produce the correct result (that would be produced if the particular comparison was done with the particular collation).

Possible solutions.

It is difficult to provide a solution to this problem and the list below is open ended:

Add a collation property to the type xs:string. Then we would specify the type as (xs:string, collation-name?)
Make fn:deep-equal and fn:compare accept not a single collation but a sequence of collation-names. In this case a pair of strings will be compared once for every collation that is specified. The idea is that the sequence of collations would be provided ordered by decreasing specificity. The first result that is produced at least twice in this process (something like voting) would be the result of the comparison. In case of a tie, the comparison done with the collation that is earliest (supposed to be more specific) will have higher priority.
Leave this as it is at present, but add to the specification a warning to the user that specifying a single collation-name may not be what they want.
Remove from fn:sort the multiple-collations parameter and allow only a single collation.

ndw commented 2 months ago

My understanding of fn:sort is that it provides a list of collations only so that you can specify different collations for different sort keys: sort by last-name using the EN-GB collation, then sort by employee-id using the Unicode code-point collation. It doesn't consider more than one collation for any given comparison.

I can't quickly construct a scenario where I think "compare these two items with these three collations and consider them a match if they are the same with any one of those collations" is likely to be comprehensible behavior.

I think it's sufficient to provide one collation to go with any operation that requires one. If your needs are more sophisticated than that, and perhaps sometimes they are, you'll have to do the sophisticated work by hand. I think that's consistent with other design choices that we've made.

With respect to the options you outlined, I think option 1 would not only be a complete redesign of string handling across the whole family of specifications, we also have strong evidence that experts in internationalization would be opposed to it. I don't think option 2 is viable for the reasons I outlined above. And I think option 4 is based on a misunderstanding of how multiple collations are used in fn:sort.

I think the status quo is fine, and I don't think it's necessary to add any warnings. If there is a consensus to add a warning, I hope that we can craft one carefully so that it doesn't cause more confusion than clarification. If I have some sophisticated use case where I think I need to consider multiple collations and I have a function that only accepts a single collation, I think my natural reaction is going to be "huh, I guess I have to do a little more work here" not ... something else

michaelhkay commented 2 months ago

It would be useful if you could supply (a) some use cases where this is needed, and (b) some examples of how specific functions (say index-of) would work.

rhdunn commented 2 months ago

Collations define the ordering of characters/codepoints when performing string comparisons for sorting or comparing strings. They apply to the set of strings in the same group (e.g. family names) and context (e.g. head words in a dictionary). As such, it doesn't make sense to associate a string with a collation.

Microsoft's "Collation, sorting, and string comparison" page [1] has various examples of different collations and their use cases, including:

Sorting can even change within one language depending on the context. For example, in Germany phonebook ordering is different than dictionary ordering.

So here you have a case where the same string can have a different collation. Likewise for strings that consist of Chinese, Japanese, or Korean characters.

You may also have cases where you want a case-sensitive ordering/matching and others a case-sensitive ordering/matching. Those apply to the context, not to the specific strings.

As has been noted, fn:sort is designed to allow different collations on different contexts (sort keys) not strings.

[1] https://learn.microsoft.com/en-us/globalization/locale/sorting-and-string-comparison

dnovatchev commented 2 months ago

It would be useful if you could supply (a) some use cases where this is needed, and (b) some examples of how specific functions (say index-of) would work.

Here is a sketch how to construct an example if one knows the exact collations to use (I don't):

$seq1 contains two strings - one is a German word that has two different spellings - like Gruen and Grün, and the second string is one with similar properties (has two different spellings) but in another language (say French or Spanish, or ...)

$seq2 contains the same two strings as $seq1 but with their different spellings, that are considered equal under the respective collation - German and (say French or Spanish, or ...).

We want to perform deep-equal($seq1, $seq2) and want the result to be true().

We can specify a $collation1 under which the first strings will be equal - but the second strings will not be equal.

Alternatively, we can specify a $collation2 under which the second strings will be equal - but the first strings will not be equal.

If we could specify two collations instead of one, then the first strings will be equal using $collation1 and the second strings will be equal using $collation2.

Thus, regardless of which collation we provide as parameter to deep-equal, the result of the comparison will be false(), but we want it to be true().

As someone wanted to search for key-names of a map using a collation, this would only be usable if all key-names are comparable under the same collation.

Imagine the names of scientists in an international organization, for a practical example. We may have problems using their names as keys of a map and wanting to be able to specify either of any possible name-spelling.

rhdunn commented 2 months ago

How do you identify whether a string contains German, French, etc. when extracting it from e.g. a text node?
How do you deal with cases where a text string contains mixed words (e.g. in a book title, wikipedia/TEI/JATS/etc. article about a person, or thing in a different language, an article/book/etc. with authors from different countries (esp. if they have married someone from another country and have e.g. a Spanish given name and a Finnish family name), etc.) [1]?
How does comparison/sorting work when the collations are different as there is then no stable order to the characters/codepoints -- see e.g. [2]: "Finnish ⟨w⟩ is generally regarded as equivalent to ⟨v⟩ (in a multilingual context it may, however, be collated separately after ⟨v⟩, as in English)."?
How do you sort the head words in e.g. an English dictionary where some of those words are foreign in origin (e.g. zeitgeist, schadenfreude, hygge, etc.) but have entered common usage?

[1] Works like those of Edgar Allan Poe contain a mix of words, phrases, etc. in different languages like Greek, French, Italian, German as well as being predominantly in English. It is also common for news articles to reference place names, people, and other words in the language of the country they are talking about -- especially in wars, earthquakes, volcanoes (like the ones in Iceland), etc.

[2] https://en.wikipedia.org/wiki/Finnish_orthography#Collation_order

rhdunn commented 2 months ago

Another issue with this is that it will make integrating with or using databases as a backend impossible. See e.g. [1]. Databases generally:

allow you to specify a single collation to a column;
allow you to optionally specify a collation on an index to e.g. allow different sorting semantics, although this is not supported by all databases;
allow you to specify a collation when sorting on a given column;
require you to specify the collation when joining columns with different collations.

Here, fn:sort is like sorting on different columns (sort keys) and providing a collation for each of those. It is not providing multiple collations for the same column (sort key, etc.).

[1] https://www.red-gate.com/simple-talk/databases/sql-server/t-sql-programming-sql-server/questions-sql-server-collations-shy-ask/

dnovatchev commented 2 months ago

How do you identify whether a string contains German, French, etc. when extracting it from e.g. a text node?

How do you deal with cases where a text string contains mixed words (e.g. in a book title, wikipedia/TEI/JATS/etc. article about a person, or thing in a different language, an article/book/etc. with authors from different countries (esp. if they have married someone from another country and have e.g. a Spanish given name and a Finnish family name), etc.) [1]?

How does comparison/sorting work when the collations are different as there is then no stable order to the characters/codepoints -- see e.g. [2]: "Finnish ⟨w⟩ is generally regarded as equivalent to ⟨v⟩ (in a multilingual context it may, however, be collated separately after ⟨v⟩, as in English)."?

How do you sort the head words in e.g. an English dictionary where some of those words are foreign in origin (e.g. zeitgeist, schadenfreude, hygge, etc.) but have entered common usage?

[1] Works like those of Edgar Allan Poe contain a mix of words, phrases, etc. in different languages like Greek, French, Italian, German as well as being predominantly in English. It is also common for news articles to reference place names, people, and other words in the language of the country they are talking about -- especially in wars, earthquakes, volcanoes (like the ones in Iceland), etc.

[2] https://en.wikipedia.org/wiki/Finnish_orthography#Collation_order

Yes Reece,

All the comments above are just a good argument why comparing strings using a single collation may not in general produce the expected result.

So either we do not specify a collation at all, or provide a set of collations and a rule, maybe such as: "Two strings are equal if there is at least one collation of the specified, under which they are equal."

dnovatchev commented 2 months ago

Another issue with this is that it will make integrating with or using databases as a backend impossible. See e.g. [1].

Yes.

And we are not talking about databases here - just about (internal, during single execution) deep-equal() results.

In a related thread @michaelhkay was considering even using collation-key as the key-name and this is also not persistable outside of the lifetime of a single execution, because different implementations (and even different versions of the same implementation) produce different collation-keys for the same string

dnovatchev commented 2 months ago

With respect to the options you outlined, I think option 1 would not only be a complete redesign of string handling across the whole family of specifications, we also have strong evidence that experts in internationalization would be opposed to it.

As far as I understood the statement of @cmsmcq the experts are against the idea for a string to have a collation-property

What is proposed in option 1 is not tho give a specific string a collation as permanent property, but to be able to attach a collation to a specific string just for the lifetime/scope of a particular (sub)expression.

This means that the same string can have different collations attached to it in different sub-expressions. Could be quite useful.

Of course, saying that "Gruen" has a permanent collation property of "SomeGermanCollation" would be very restricting and wrong in general, and this is not what option1 is about.

michaelhkay commented 2 months ago

So either we do not specify a collation at all, or provide a set of collations and a rule, maybe such as: "Two strings are equal if there is at least one collation of the specified, under which they are equal."

In general that will give you a non-transitive equality operation, which will break many functions such as distinct-values(). With sorting, a collation that doesn't implement a total ordering is very likely to cause non-termination. This is territory where angels should fear to tread.

dnovatchev commented 2 months ago

So either we do not specify a collation at all, or provide a set of collations and a rule, maybe such as: "Two strings are equal if there is at least one collation of the specified, under which they are equal."

In general that will give you a non-transitive equality operation, which will break many functions such as distinct-values(). With sorting, a collation that doesn't implement a total ordering is very likely to cause non-termination. This is territory where angels should fear to tread.

OK, so we do recognize the problem, and we have not yet found a good solution.
Even this is a good step forward.

I think if we could have a union of collations, then equality would be transitive.

dnovatchev commented 2 months ago

In general that will give you a non-transitive equality operation, which will break many functions such as distinct-values().

If we define the equality operation as:

Two strings are equal if there is at least one collation in the specified set of collations, using which the two strings are equal.

Then can you provide an example of this alleged non-transitiveness?

michaelhkay commented 2 months ago

Sure. If Gruen and Grün are equal under German collation, while Gruen and Grven are equal under MedievalLatin collation, it may well be the case that there is no collation under which Grün and Grven are equal.

dnovatchev commented 2 months ago

Sure. If Gruen and Grün are equal under German collation, while Gruen and Grven are equal under MedievalLatin collation, it may well be the case that there is no collation under which Grün and Grven are equal.

Great.

This proves that the so defined string equality is not transitive.

How about we define string equality as:

"Two strings s1 and s2 are equal when compared with a set of collations COL = {c1, c2, ..., cn} if there exists a sequence of strings: (sm1, sm2, ... smk), and a sequence of collations from COL such that :

equal(s1, sm1, cm1) and equal(sm1, sm2, cm2) and ... equal(smk, s2, cm(k+1)

This is essentially the problem of "word ladders", which has a good and efficient graph-search based solution, and which I had the pleasure of implementing in the past in pure XSLT : "Word Ladders, or How to go from “angry” to “happy” in 20 steps" "

With this definition we will have :

equal("Grün", "Gruen", "German") and equal("Gruen", "Grven", "Medieval")

thus, this means:

equal("Grün", "Grven", COL)

In other words: we define "chain-equality" as the transitive closure of "immediate equality".

We already have a function in FO for computing a transitive closure, don't we?

rhdunn commented 2 months ago

The equality and transitivity is given by the collation. With a single collation there is a well-defined ordering and comparison of codepoints.

The issue is when dealing with multiple collations where the ordering differs -- such as a diacritic insensitive collation that orders diacritic versions of letters along side the same letter (ignoring the diacritic) and e.g. a finnish collation that defines where the finish diacritic letter A variants are placed. I.e. you have a diagreement on the placement of the diacritic A variants so don't know how to select which collation takes precence. And if you did you would break the collations, e.g. if an author said "I want this table to be in finnish collation ordering" e.g. by specifying a default or explicit collation.

Lets say that F is a finnish collation and DI is a diacritic insensitive collation. If strings A and B are in F and C and D are in DI then how do you order those strings? With multiple collations you could get different answers comparing (A, C), (A, B), (B, C), etc. such that no unique ordering can be found.

dnovatchev commented 2 months ago

The issue is when dealing with multiple collations where the ordering differs

We are not talking about ordering here (yet) - just about equality.

I provided a definition of equality that is transitive over a set of collations.

rhdunn commented 2 months ago

Collations are used for sorting as well as equality, as per fn:sort, etc.

dnovatchev commented 2 months ago

Collations are used for sorting as well as equality, as per fn:sort, etc.

Yes. Here we are discussing equality with collations, and a probable improvement of functions as deep-equal, index-of, ..., etc.

rhdunn commented 2 months ago

But whatever approach we choose for equality needs to work with sorting/ordering. That's because collations are used for both of these and thus needs to work for both.

An implementation may use things like an index to provide the functionality.

dnovatchev commented 2 months ago

But whatever approach we choose for equality needs to work with sorting/ordering. That's because collations are used for both of these and thus needs to work for both.

An implementation may use things like an index to provide the functionality.

Yes, but we may use different approaches for equality and for sorting.

dnovatchev commented 2 months ago

But whatever approach we choose for equality needs to work with sorting/ordering. That's because collations are used for both of these and thus needs to work for both. An implementation may use things like an index to provide the functionality.

Yes, but we may use different approaches for equality and for sorting.

Actually, we don't have any problems with sorting at all.

qt4cg / qtspecs

Almost all functions in FO that must process multiple string items, can have as a parameter only a single collation #1305