w3c / sparql-dev

SPARQL dev Community Group
https://w3c.github.io/sparql-dev/
Other
123 stars 19 forks source link

Improve usability and predictability of sorting #88

Open kasei opened 5 years ago

kasei commented 5 years ago

Why?

There are several cases where the current spec does not provide a total ordering over RDF terms, and therefore causes challenges for accessing data predictably (e.g. when paging results with LIMIT+OFFSET). SPARQL 1.1 §15.1 says, in part:

SPARQL does not define a total ordering of all possible RDF terms. Here are a few examples of pairs of terms for which the relative order is undefined:

  • "a" and "a"@en_gb (a simple literal and a literal with a language tag)
  • "a"@en_gb and "b"@en_gb (two literals with language tags)

The second point here is especially interesting, as it means that it is difficult to portably work with any RDF data that heavily uses language-tagged literals.

Previous work

Many implementations already seem to produce a consistent ordering over data for which SPARQL ordering is undefined.

Proposed solution

I believe that the SPARQL spec should add text stating that ORDER BY over values with a (currently) undefined order SHOULD cause results to have consistent ordering, even if that order is not explicitly defined by SPARQL. This will allow clients to use LIMIT/OFFSET paging over such data. This might also be paired with a Service Description Feature indicating support for such consistent sorting.

Possible (partial) alternatives include:

Considerations for backward compatibility

This is a suggestion to include SHOULD normative language about ordering data in cases where currently no ordering is defined. This should not have any effect on backwards compatibility.

lisp commented 5 years ago

i call attention to the "values bound to ?string", which are the sort key for the order operation in the example, which demonstrate that, "in the general case, [] given equal values, changes to the ORDER BY operator will not resolve the problem at issue."

TallTed commented 5 years ago

@lisp - Again, YES, variables in the SELECT which are not in the ORDER BY will not be ordered, will not affect the order of the solution set. This is known, and clear, and I believe this to be a different concern than this issue.

VladimirAlexiev commented 4 years ago

FWIW, here's how GraphDB (and I presume rdf4j) do ordering.

Scalar Ordering

Here are some details about how scalars are ordered. These details come from the GraphDB SPARQL query processor and you can check them with a query like this (try also to change the direction to DESC())

prefix xsd: <http://www.w3.org/2001/XMLSchema#>
prefix my: <http://example.org/>
base <http://example.org/>
select * {
  values ?x {
    "0001-01-01T00:00:00"^^xsd:dateTime "0001-01-01"^^xsd:date 
    "z" "1" "2"
    "z"@en "z"@en-GB "z"@fr "1"@en "1"@en-GB "1"@fr
    undef 
    002 2 002.000 2.0 "002.000"^^xsd:float "2.0"^^xsd:float
    001 1 001.000 1.0 "001.000"^^xsd:float "1.0"^^xsd:float
    "1"^^my:foo "1"^^my:bar "2"^^my:baz
    <foo> <http://example.org/bar> my:baz <mailto:foo@example.org> <geo:42.68,23.21> <urn:uuid:1234> <urn:isbn:4567>
  }
} order by ASC(?x)

Values are grouped by kind. The ordering of these groups is as follows in the ASC (ascending) direction:

The order is stable, i.e. equal values of the same kind are emitted in the same order as encountered.

If you use DESC (descending) the order is reversed, so eg nulls come last. But stability is preserved, which means that ASC and DESC are not complete inverses of each other.