Improve usability and predictability of sorting

kasei commented 5 years ago

Why?

There are several cases where the current spec does not provide a total ordering over RDF terms, and therefore causes challenges for accessing data predictably (e.g. when paging results with LIMIT+OFFSET). SPARQL 1.1 §15.1 says, in part:

SPARQL does not define a total ordering of all possible RDF terms. Here are a few examples of pairs of terms for which the relative order is undefined:

"a" and "a"@en_gb (a simple literal and a literal with a language tag)

"a"@en_gb and "b"@en_gb (two literals with language tags)

The second point here is especially interesting, as it means that it is difficult to portably work with any RDF data that heavily uses language-tagged literals.

Previous work

Many implementations already seem to produce a consistent ordering over data for which SPARQL ordering is undefined.

Proposed solution

I believe that the SPARQL spec should add text stating that ORDER BY over values with a (currently) undefined order SHOULD cause results to have consistent ordering, even if that order is not explicitly defined by SPARQL. This will allow clients to use LIMIT/OFFSET paging over such data. This might also be paired with a Service Description Feature indicating support for such consistent sorting.

Possible (partial) alternatives include:

Define an ordering of all language-tagged strings (e.g. using Unicode codepoint collation; simple, but likely to cause unexpected results in some languages)
Defining a total ordering over plain literals, language-tagged literals, and literals typed with with datatypes that are supported explicitly by the SPARQL language (a large amount of work, and also likely to cause unexpected results)
Allow systems declare (via Service Description) the specific collation used to compare terms (there is already support for collations in XPath's fn:compare on which SPARQL ordering depends, and they are identified by URIs)
Allow queries to specify a collation to be used for term comparison (larger implementation burden, but flexible and could declare via Service Description which collations are available)
Define a way to explicitly request consistent ordering specifically for paging (e.g. ORDER BY CONSISTENT ?name) without requiring any particular ordering

Considerations for backward compatibility

This is a suggestion to include SHOULD normative language about ordering data in cases where currently no ordering is defined. This should not have any effect on backwards compatibility.

kasei commented 5 years ago

As an aside, I'll note that the examples in the quoted text from §15.1 use invalid syntax for the language tags. That should probably be fixed, too.

cygri commented 5 years ago

(The invalid language tags are already mentioned on our errata page.)

cygri commented 5 years ago

Why not have the spec define a simple total ordering (e.g., alphabetically by language tag, and then by Unicode codepoint within each language tag), and state that processors MAY deviate from this ordering where doing so improves i18n?

namedgraph commented 5 years ago

Dydra provides a collation-based sorting. It uses language tags only, no extensions required. @lisp has probably much more to say about this.

dbooth-boston commented 5 years ago

. . . and state that processors MAY deviate from this ordering where doing so improves i18n?

How would such a deviation improved i18n? Can you give an example? If we can remove functional variability between implementations I would much prefer it.

cygri commented 5 years ago

@dbooth-boston Well, just saying “i18n” is a bit narrow actually. I should have said “MAY deviate where it better matches user expectations” or something like that. Examples include:

Case folding—sort “A” and “a” together
Language-specific sorting rules (collations)—e.g., in German “ü” is sorted as “ue”
Sorting @en together with @en-uk

Some processors want to go all in here and support language-specific collations and so on, but for other implementers this would be a very heavy burden. Hence my proposal of simple sort-by-language-tag followed by sort-by-lexical-form-codepoints, while allowing processors to do something fancier if they want.

dbooth-boston commented 5 years ago

Perhaps the collating order should be controlled by locale environment variables, just as it is for other standard tools such as sort.

namedgraph commented 5 years ago

Re. collations, I think as much as possible should be reused from XPath and XSLT: Sorting Using Collations

kasei commented 5 years ago

Why not have the spec define a simple total ordering (e.g., alphabetically by language tag, and then by Unicode codepoint within each language tag), and state that processors MAY deviate from this ordering where doing so improves i18n?

My concern here is twofold:

Ordering by codepoint is going to be undesirable for some languages
I think using MAY language would lead to a situation where users might unintentionally rely on the widely-supported codepoint ordering, and then be very surprised when they run across an implementation that does it another way. In the worst case, that surprise might take the form of things breaking.

kasei commented 5 years ago

Perhaps the collating order should be controlled by locale environment variables, just as it is for other standard tools such as sort.

I'm primarily concerned here with making sure that paging works predictably. Adding requirements to support collation would, as @cygri notes, be a significant burden on some implementations.

cygri commented 5 years ago

@kasei

I think using MAY language would lead to a situation where users might unintentionally rely on the widely-supported codepoint ordering, and then be very surprised when they run across an implementation that does it another way. In the worst case, that surprise might take the form of things breaking.

Another possibility would be to strictly mandate codepoint ordering. Implementations that want to offer collation support can do so by using a pseudo-function:

ORDER BY ex:collation("en_EN", ?name)

(ex:collation would not be a real function, but a marker that requests a particular collation by mildly abusing function syntax. One could think of it as being similar to DESC(?x) which looks like a function but isn't.)

dbooth-boston commented 5 years ago

@namedgraph , it looks like W3C has only defined two collation URIs: codepoint and html-ascii-case-insensitive. And the latter does not define a sorting order -- it only defines case-insensitive equality: ". . . any use of this collation for sorting strings produces implementation-defined results." Are there more standard XPath collation URIs defined elsewhere? If so, where and what?

Regarding codepoint ordering versus full collation support, I do not have a strong preference either way, as long as there is a reasonably smooth future path for collation support, since that is probably where we will eventually want to go, in order to be consistent with other standard tools such as sort.

SIDE COMMENT: Personally, I have always hated the fact that sort by default uses the locale instead of codepoint, because it means that unless I am careful to remember to explicitly set the locale in my scripts, the sort results are unpredictable from one installation to another. Grrr.

kasei commented 5 years ago

Regarding codepoint ordering versus full collation support, I do not have a strong preference either way, as long as there is a reasonably smooth future path for collation support, since that is probably where we will eventually want to go, in order to be consistent with other standard tools such as sort.

This seems sensible. I personally think attempting to address full collation is too much for 1.2 (and possibly too much of an implementation burden in general), but that we shouldn't preclude its possible inclusion in the future.

I also want to be sensitive to the fact that as an English speaker I might not be fully aware of all the ways in which enforcing codepoint ordering might be a problem for other people with other datasets (e.g. CJK data).

lisp commented 5 years ago

doing this half-way makes little sense. that is, either

provide a total order - even for "incomparables" when sorting
support language-specific collation
do not require/permit configuration

or the retitle the issue to leave out "usability" and "predictability".

lisp commented 5 years ago

This seems sensible. I personally think attempting to address full collation is too much for 1.2 (and possibly too much of an implementation burden in general), but that we shouldn't preclude its possible inclusion in the future.

we use a very thin wrapper over ucol_strcollUTF8. if a recommendation were to require wider support, that would entail more work. for that limited support, if one can use that library and tolerates that dependency, the test itself is about 20 lines of code more than treating the strings as incomparable.

kasei commented 5 years ago

or the retitle the issue to leave out "usability" and "predictability".

@lisp The primary issue I'm trying to improve here is paging, and that is currently impossible to use predictably. I believe that a small change to the language of the spec could make a big usability improvement here, by allowing users to use paging on data whose sorting is currently undefined (e.g. all language-tagged literals).

We could certainly talk about adding full collation, and that would subsume this issue, but that was not the point of this issue.

we use a very thin wrapper over ucol_strcollUTF8 ... for that limited support, if one can use that library and tolerates that dependency, the test itself is about 20 lines of code more than treating the strings as incomparable.

The implementation experience is good to know about, but ICU is a significant dependency and may not be appropriate for all implementations.

lisp commented 5 years ago

@lisp The primary issue I'm trying to improve here is paging, and that is currently impossible to use predictably. I believe that a small change to the language of the spec could make a big usability improvement here, by allowing users to use paging on data whose sorting is currently undefined (e.g. all language-tagged literals).

i would resist ratifying a limited change. were it the recommendation, an accurate sort would then either be nonconformant, which i would not recognize as an advantage, or require declaration/configuration, which would be source of its own problems.

kasei commented 5 years ago

@lisp as described, wouldn't your implementation already conform with the proposed language?

lisp commented 5 years ago

not as i understand https://www.w3.org/TR/xpath-functions-31/#codepoint-collation .

kasei commented 5 years ago

@lisp The proposal here is to add text about providing a consistent ordering, not necessarily any particular ordering. As described, I think your implementation already does that.

Codepoint collation was mentioned as a possible alternative that might address the same use case, but at a higher cost (both implementation cost and by running the risk of working against implementations that already use some other collation).

VladimirAlexiev commented 5 years ago

What would be a sensible order when numbers and dates are mixed with strings? Graphdb Workbench has a feature "click on col header to sort" and I think we put numbers before dates before strings or some such

lisp commented 5 years ago

What would be a sensible order when numbers and dates are mixed with strings?

as long as it is consistent across implementations, is one order better than another. in any case, it should cover all types which are eventually added via #93.

lisp commented 5 years ago

about providing a consistent ordering, not necessarily any particular ordering

if consistent is to be across implementations then the string element matters.

kasei commented 5 years ago

if consistent is to be across implementations then the string element matters.

My concern isn't consistency across implementations, but about consistency across requests to the same implementation.

lisp commented 5 years ago

by that criteria, the implementation could sort according to internal term identifiers without regard to the lexical forms.

kasei commented 5 years ago

by that criteria, the implementation could sort according to internal term identifiers without regard to the lexical forms.

So long as that sorting also respected the partial ordering specified by SPARQL, yes, I'd be fine with that. In reality, it's nearly impossible to assign internal IDs so that they also respect SPARQL ordering, but you can get part-way there by partitioning different term types using high bits of the ID integer.

lisp commented 5 years ago

the partial order specified by sparql follows from type only. by that standard, criteria which completely ignore the nature of the respective value domain should be a reasonable approach for sparql?

kasei commented 5 years ago

the partial order specified by sparql follows from type only. by that standard, criteria which completely ignore the nature of the respective value domain should be a reasonable approach for sparql?

No, because the value space of numeric terms has to be respected.

lisp commented 5 years ago

that is "the value space of numeric terms has to be respected", but for string terms the order with respect to other types should follow the partial order in 15.1, while that within their value domain could be according to any stable surrogate values?

kasei commented 5 years ago

If I understand you correctly, then yes, that sounds right. 15.1 must be respected which includes the partial order of term type and then some specific datatype ordering such as numerics and xsd:string (anything in the operator mapping table). Beyond that, internal identifiers could be used to provide a fallback comparable value for cases where 15.1 is silent.

VladimirAlexiev commented 5 years ago

I didn't read the above very carefully. Could someone confirm or deny that given a variety of numbers, they'll all sort in a global order? eg

"1"^^xsd:integer
"1"^^xsd:int
"1.1"^^xsd:decimal
"1.11"^^xsd:float
"2"^^xsd:int

lisp commented 5 years ago

if your question is whether the '<' operator defines the relation between

"1"^^xsd:integer "1"^^xsd:int

i do not understand 17.3 to define their order.

afs commented 5 years ago

@VladimirAlexiev That is a higher hurdle than predictability/"consistent order" from a specific endpoint.

The motivation of paging only needs the same ordering at a given endpoint, not a specific ordering nor the same ordering everywhere.

Paging maybe better done other, more complicated ways such as being part of the protocol, e.g. #49 (which notes the inefficiency of repeat requests with moving slice) or #84 (a chance to design a protocol for specifically for SPARQL).

Consistent ordering, either with or without streaming, provides the ability to page results.

So the spec issue could be as little as "Implementations SHOULD order RDF terms into the same order for all requests".

FWIW: Jena sorts into a consistent order - by value (spec required), then by term kind: "Blank nodes < URIs < Literals" and then by RDF term detail (literals by lexical form and then datatype then language tag). It gets quite arbitrary but it is the same each time.

There is no single "right" answer for all use cases once you get passed values. (e.g. a single fixed order vs collation by language can't have both at the same time).

lisp commented 5 years ago

if the sentiment is to refocus this topic to limit it to the "paging" use case, then please rewrite the description to eliminate, for example, the issue, that

difficult to portably work with any RDF data that heavily uses language-tagged literals

if the issue is limited to reproducible paging, it may not even be the case that any form of sorting is involved.

kasei commented 5 years ago

if the sentiment is to refocus this topic to limit it to the "paging" use case, then please rewrite the description to eliminate, for example, the issue, that

difficult to portably work with any RDF data that heavily uses language-tagged literals

@lisp – I included the example of language-tagged literals because it perfectly illustrates the problem with paging. If your data is not comparable w.r.t. SPARQL (as in the case of all language-tagged literals), then paging cannot be used reliably, because there are simply no guarantees that the order will be consistent.

lisp commented 5 years ago

where incomparable terms lead to order variations between repeated, identical query expressions, incomparability alone is not the cause. will the result not be consistent, if the respective combination operators are deterministic and any sort is stable?

kasei commented 5 years ago

where incomparable terms lead to order variations between repeated, identical query expressions, incomparability alone is not the cause. will the result not be consistent, if the respective combination operators are deterministic and any sort is stable?

I'm afraid I don't understand this. Can you expand or rephrase?

lisp commented 5 years ago

the issue cites

challenges for accessing data predictably (e.g. when paging results with LIMIT+OFFSET)

and describes that the incomparability of language-tagged literals "means that it is difficult to portably work with any RDF data" which uses them.

i would expect that, if the initial statement pattern match/scan operators and all combination operators are deterministic, then an unsorted query should be repeatable. if any sort operators are stable, incomparable values should not move in results and the sorted query should also be repeatable. in other words, comparability is not itself the issue.

i suggest that, if something is to be changed about sorting, it should be to make the result more useful, which includes at least to define the relation among all possible string values, if not to provide the kind of complete ordering described wrt jena, above.

kasei commented 5 years ago

i would expect that, if the initial statement pattern match/scan operators and all combination operators are deterministic, then an unsorted query should be repeatable.

Operators are not always going to be deterministic in the order they produce results. For example, changes to indexes and randomized in-memory structures are two reasons that query results might be produced in different orders.

if any sort operators are stable, incomparable values should not move in results and the sorted query should also be repeatable. in other words, comparability is not itself the issue.

SPARQL doesn't say anything about sort stability being required.

i suggest that, if something is to be changed about sorting, it should be to make the result more useful, which includes at least to define the relation among all possible string values, if not to provide the kind of complete ordering described wrt jena, above.

If by "all possible string values," you are including language-tagged strings, I would think that is too large and fraught a task for this CG. This is an issue which will be implementation/deployment specific. However, it could be proposed as a separate issue (which if accepted would subsume this one) so that it could be discussed on its own merits.

lisp commented 5 years ago

Operators are not always going to be deterministic in the order they produce results. For example, changes to indexes and randomized in-memory structures are two reasons that query results might be produced in different orders.

yes. by which assessment you are proposing to change one operator to correct an issue with others.

TallTed commented 5 years ago

The original issue was written about queries with ORDER BY clauses, which may unexpectedly produce incompletely and inconsistently ordered solution sets, even from the same instance due to the incompletely specified sorting in the original spec. Improving on this seems worthy.

That said, this is one of the areas which has long been problematic in the SQL RDBMS world, both across engines and across instances of the same engine, which may be set to different collations among other LOCALE-associated settings. It will be important to consider and understand the issues that have come up there, in order to apply the lessons learned here -- instead of repeating mistakes.

That said, the last few comments here seem to be asking that the order of the solution set for a query that lacks an ORDER BY clause should be identical every time it's run on the same instance, whether within seconds, minutes, weeks, or longer.

Solution discovery and hence delivery order may change for a number of reasons, in both short and long time frames. Default solution output order is intentionally undefined, and I believe it should remain so. Forcing an ordering in the way suggested may be a VERY heavy load, and in fact, requires that a partial solution set never be delivered except after the full solution set is found. The choice to add an ORDER BY clause and its attendant load is (or should be) an informed choice by the user, not an imposition by the SPARQL standard.

lisp commented 5 years ago

Forcing an ordering in the way suggested may be a VERY heavy load, and in fact, requires that a partial solution set never be delivered except after the full solution set is found.

please reiterate: which "way suggested"?

"repeatable" is not the same as "defined".
where one sorts the result as a workaround, because the solution sequence order may change order from one request to the next, this will necessarily materialize that sequence. if one can arrange that the order be reproducible, one need not materialize.

TallTed commented 5 years ago

@lisp

A DBMS may discover solutions of a query in any order, which order of discovery may change between executions of the same query due to many factors including (but not limited to) --

re-partitioning of data
re-indexing
INSERTs, DELETEs, UPDATEs and other actions which change the order of the data within the store
query execution plan (which may vary as the statistics of the stored data change)

In other words -- the discovery order is not inherently repeatable.

Because there is no viable way to force the discovery of solutions to always occur in the same order, there is no way to force the delivery of those solutions to always occur in the same order -- except by declaring that every query processor must always apply an ORDER BY -- even if that ORDER BY is internally defined by the processor, even on a per-query basis -- before delivering the solutions -- which by necessity means before delivering any solutions.

lisp commented 5 years ago

where "that ORDER BY [which] is internally defined by the processor" causes solutions to always occur in the same order, even where the terms which are subject to comparison between two respective solutions happen to be the same term?

TallTed commented 5 years ago

@lisp - I do not understand your comment.

lisp commented 5 years ago

this issue discusses solution ordering guarantees and how to achieve them. the initial description reports deficiencies in the current specification for ORDER BY, in particular with respect to rdf:langString terms, and proposes that minimal changes would suffice to permit paging. a later comment suggests that, as it is impossible to "force the delivery of those solutions to always occur in the same order", in order to provide a guarantee, "every query processor must always apply an ORDER BY".

i called into question the value of a change which improves the partial order, but does not make it complete. i also question whether a complete order would resolve the issue.

consider in the following queries, what ordering guarantees are expected, why, and what role does sort play in them?

select ?string ?number
where {
  values (?string ?number) {("a" 1) ("a" 2)}
} order by ?string

https://dydra.com/james/foaf/@query#ex-with-order (the sort is not stable)

select ?string ?number
where {
  values (?string ?number) {("a" 1) ("a" 2)}
}

https://dydra.com/james/foaf/@query#ex-without-order (delivery should repeat)

whereby, a VALUES clause is not the only source for which a repeatable discovery order is viable.

TallTed commented 5 years ago

@lisp - I don't think I understand your point.

I think your sample queries remind us that ORDER BY only has impact to the extent that the ORDER BY clause includes the variables in the SELECT -- i.e., a query that SELECTs variables that are not included in the ORDER BY clause will not necessarily have its solutions delivered in predictable nor consistent order.

I think your queries also remind us that solution discovery order does not necessarily impact the order of solution delivery.

But how does this relate to the basic point of this issue, which remains a wish to improve (if not complete) the specification of relative ordering within ORDER BY?

lisp commented 5 years ago

the values bound to ?string in the example are equal. the sort operators of which i am aware require a consistent precedence relation between respective values in order to produce a predictable result. under those conditions, the equal values will be treated the same as incomparable values. the effect is that, in the general case, that is given equal values, changes to the ORDER BY operator will not resolve the problem at issue.

TallTed commented 5 years ago

@kasei - I wonder whether your concern might be (partially) addressed by advising users who need to care about such things that they might consider using constructs like --

ORDER BY ASC ( LANG ( ?langtaggedvar ) ) 
         ASC ( STR ( ?langtaggedvar ) )

-- or --

ORDER BY ASC ( STR ( ?langtaggedvar ) ) 
         ASC ( LANG ( ?langtaggedvar ) )

Clearly, this doesn't get into internationalized collation ordering ... and at first I thought perhaps that could be addressed by a new optional argument to the ASC/DESC, as --

ORDER BY ASC ( STR ( ?langtaggedvar ), collation_identifier )

-- and then I went looking for previous work, and oh my what a mess.

RFC 4790, Internet Application Protocol Collation Registry, made a start in 2007, but the IETF Collation Registry for Internet Application Protocols was last updated in 2013 and has all of three defined collations.
the Unicode Common Locale Data Repository Collation Guidelines are beyond complex
MySQL Collation docs
Snowflake Collation docs
Transact-SQL (Microsoft SQL Server) Collation docs
...endless others...

TallTed commented 5 years ago

@lisp -- Are you arguing with my points, or agreeing?

@kasei's original post could have been clearer in the issue description (the Why?), by noting that it was focused only on ORDER BY -- but I think this is made clear by the Proposed solution --

ORDER BY over values with a (currently) undefined order SHOULD cause results to have consistent ordering

Note -- the above says nothing about values outside the ORDER BY, on which it appears you are focused.

w3c / sparql-dev