kasei commented 5 years ago

Why?

There are several cases where the current spec does not provide a total ordering over RDF terms, and therefore causes challenges for accessing data predictably (e.g. when paging results with LIMIT+OFFSET). SPARQL 1.1 §15.1 says, in part:

SPARQL does not define a total ordering of all possible RDF terms. Here are a few examples of pairs of terms for which the relative order is undefined:

"a" and "a"@en_gb (a simple literal and a literal with a language tag)

"a"@en_gb and "b"@en_gb (two literals with language tags)

The second point here is especially interesting, as it means that it is difficult to portably work with any RDF data that heavily uses language-tagged literals.

Previous work

Many implementations already seem to produce a consistent ordering over data for which SPARQL ordering is undefined.

Proposed solution

I believe that the SPARQL spec should add text stating that ORDER BY over values with a (currently) undefined order SHOULD cause results to have consistent ordering, even if that order is not explicitly defined by SPARQL. This will allow clients to use LIMIT/OFFSET paging over such data. This might also be paired with a Service Description Feature indicating support for such consistent sorting.

Possible (partial) alternatives include:

Define an ordering of all language-tagged strings (e.g. using Unicode codepoint collation; simple, but likely to cause unexpected results in some languages)
Defining a total ordering over plain literals, language-tagged literals, and literals typed with with datatypes that are supported explicitly by the SPARQL language (a large amount of work, and also likely to cause unexpected results)
Allow systems declare (via Service Description) the specific collation used to compare terms (there is already support for collations in XPath's fn:compare on which SPARQL ordering depends, and they are identified by URIs)
Allow queries to specify a collation to be used for term comparison (larger implementation burden, but flexible and could declare via Service Description which collations are available)
Define a way to explicitly request consistent ordering specifically for paging (e.g. ORDER BY CONSISTENT ?name) without requiring any particular ordering

Considerations for backward compatibility

This is a suggestion to include SHOULD normative language about ordering data in cases where currently no ordering is defined. This should not have any effect on backwards compatibility.

lisp commented 5 years ago

i call attention to the "values bound to ?string", which are the sort key for the order operation in the example, which demonstrate that, "in the general case, [] given equal values, changes to the ORDER BY operator will not resolve the problem at issue."

TallTed commented 5 years ago

@lisp - Again, YES, variables in the SELECT which are not in the ORDER BY will not be ordered, will not affect the order of the solution set. This is known, and clear, and I believe this to be a different concern than this issue.

VladimirAlexiev commented 4 years ago

FWIW, here's how GraphDB (and I presume rdf4j) do ordering.

Scalar Ordering

Here are some details about how scalars are ordered. These details come from the GraphDB SPARQL query processor and you can check them with a query like this (try also to change the direction to DESC())

prefix xsd: <http://www.w3.org/2001/XMLSchema#>
prefix my: <http://example.org/>
base <http://example.org/>
select * {
  values ?x {
    "0001-01-01T00:00:00"^^xsd:dateTime "0001-01-01"^^xsd:date 
    "z" "1" "2"
    "z"@en "z"@en-GB "z"@fr "1"@en "1"@en-GB "1"@fr
    undef 
    002 2 002.000 2.0 "002.000"^^xsd:float "2.0"^^xsd:float
    001 1 001.000 1.0 "001.000"^^xsd:float "1.0"^^xsd:float
    "1"^^my:foo "1"^^my:bar "2"^^my:baz
    <foo> <http://example.org/bar> my:baz <mailto:foo@example.org> <geo:42.68,23.21> <urn:uuid:1234> <urn:isbn:4567>
  }
} order by ASC(?x)

Values are grouped by kind. The ordering of these groups is as follows in the ASC (ascending) direction:

Null (undef): when the ordering field is null for some objects
IRIs, ordered alphabetically (prefixed and relative IRIs are expanded), eg:

<geo:42.68,23.21> http://example.org/bar http://example.org/baz http://example.org/foo mailto:foo@example.org

Numeric values, ordered numerically. (Note: the shortcut literals 002 and 002.000 mean "002"^^xsd:integer and "002.000"^^xsd:decimal respectively):

"001"^^xsd:integer "1"^^xsd:integer "001.000"^^xsd:decimal "1.0"^^xsd:decimal "001.000"^^xsd:float "1.0"^^xsd:float
"002"^^xsd:integer "2"^^xsd:integer "002.000"^^xsd:decimal "2.0"^^xsd:decimal "002.000"^^xsd:float "2.0"^^xsd:float

Dates, ordered chronologically:

"000001-01-01"^^xsd:date "0001-01-01"^^xsd:date "0001-01-02"^^xsd:date

Datetimes, ordered chronologically (please note these are not comparable to dates):
```
"0001-01-01T00:00:00"^^xsd:dateTime "000001-01-01T00:00:00"^^xsd:dateTime
```
Datatyped literals other than numbers, dates and datetimes, ordered first by datatype then by value:
```
"1"^^my:bar "2"^^my:baz "1"^^my:foo
```
langStrings, ordered first by language then by value:
```
"1"@en "z"@en "1"@en-GB "z"@en-GB "1"@fr "z"@fr
```

Plain strings:

"1"^^xsd:string "2"^^xsd:string "z"^^xsd:string

The order is stable, i.e. equal values of the same kind are emitted in the same order as encountered.

If you use DESC (descending) the order is reversed, so eg nulls come last. But stability is preserved, which means that ASC and DESC are not complete inverses of each other.

w3c / sparql-dev

Improve usability and predictability of sorting #88

Why?

Previous work

Proposed solution

Considerations for backward compatibility

Scalar Ordering