joshsh / sesametools

A collection of utilities for use with OpenRDF Sesame (as of recently, Eclipse RDF4J)
Other
50 stars 11 forks source link

ValueComparator and SPARQL 1.1 ORDER BY #18

Closed ansell closed 13 years ago

ansell commented 13 years ago

Our current ValueComparator basically matches the SPARQL 1.1 ORDER BY precedence rules.

http://www.w3.org/TR/sparql11-query/#modOrderBy

The current implementation follows it quite closely already, using NULL>BNode>URI>Literal , but the rules for Literal Datatype precedence are a little more complicated than the textual sort on the datatype URI followed by lexical sort that we are using so far.

Basically, the Operator Mapping table (link below) is sorted from top to bottom by precedence, and the actual values are sorted based on the native XML Schema datatypes, with other datatypes sorted last based, possibly, on lexical comparison of the datatype URIs.

http://www.w3.org/TR/sparql11-query/#OperatorMapping

We could have an extension method in ValueComparator that could be overriden to provide other sorting for literals with unknown datatypes, and by default sort by lexical comparison in the base ValueComparator for unknown datatypes.

For instance, we could move the current implementation down to another method that may look like:

// Note: Override me to provide different behaviour ValueComparator.sortUnknownDatatype(Literal literal1, Literal literal2) { // By default sort based on literal1.getDatatype().stringValue().compareTo(...) and then literal1.stringValue().compareTo(literal2.stringValue()) if the datatypes are equal }

Alternatively, we could add a method sortLiterals(Literal literal1, Literal literal2) and subclass the current ValueComparator to SparqlCompatibleValueComparator to override sortLiterals and provide compatible behaviour with SPARQL 1.1.

joshsh commented 13 years ago

Ah, yes it makes sense that there is a SPARQL total order for literals. In this case, shouldn't an implementation of that order must be present in Sesame, as a compliant implementation of SPARQL 1.1? It seems like a lot of work to reimplement it here, but I'm not sure where you're going with StatementComparator. Is it more than a handy way to create Statement sets?

ansell commented 13 years ago

Originally, I looked for a StatementComparator in sesame and didn't find one that handled nulls the way we wanted to handle them in RDF/JSON, so I created StatementComparator. When I created ValueComparator I never thought to look at Sesame to see if they had one already. Apparently they do, although their implementation doesn't exactly match SPARQL 1.1, and looks very messy right now.

It is split across two files:

http://repo.aduna-software.org/svn/org.openrdf/sesame/branches/2.6/core/queryalgebra/evaluation/src/main/java/org/openrdf/query/algebra/evaluation/util/ValueComparator.java

and,

http://repo.aduna-software.org/svn/org.openrdf/sesame/branches/2.6/core/queryalgebra/evaluation/src/main/java/org/openrdf/query/algebra/evaluation/util/QueryEvaluationUtil.java

The type precedence comment in QueryEvaluationUtil.compareLiterals() doesn't match the OperatorMapping table

    // type precendence:
    // - simple literal
    // - numeric
    // - xsd:boolean
    // - xsd:dateTime
    // - xsd:string
    // - RDF term (equal and unequal only)

In SPARQL 1.1, the precendence should be:

    // type precendence:
    // - numeric
    // - simple literal
    // - xsd:string
    // - xsd:boolean
    // - xsd:dateTime
    // - RDF term (equal and unequal only)

Instead of reimplementing this, it would be best to file a bug report with Sesame to get the order updated. They will have to update both files, as they assume the simple literal precedence is first in ValueComparator before they get to QueryEvaluationUtil.

After the fix it, I would suggest using their ValueComparator inside our StatementComparator to ensure compatibility.

joshsh commented 13 years ago

Yes, perhaps we can use the Sesame comparator when it is ready. However, I'm still not clear on the need for it. It's an extra dependency, after all, or a lot of new code. It's likely to be less CPU-efficient than our current lexical order, as well. Neither RDF/JSON nor JSONLD appear to require ordering of objects or statements. I thought StatementComparator was a workaround for Sesame's context-indifferent Statement hashCode. If we anticipate certain use cases involving graph normalization, then perhaps a standards-compliant comparator makes sense. Please let me know if I'm missing something.

ansell commented 13 years ago

StatementComparator was a way to automatically sort a TreeSet so that it could be directly output, based on the order of Statements in the Set, to RDF/JSON without needing another sort at that point. ValueComparator was created to support the process by encapsulating the Value sorting process in another, reusable, class.

RDF/JSON only needs objects to be ordered if Graphs are in use, otherwise it only needs subjects and predicates to be ordered next to each other, the actual order doesn't matter as long as similar statements are guaranteed to be localised together. You can't really rely on HashCode's for strict localisation of similar subject/predicate/object combinations, even if they do support null contexts the way we need. If the HashSet is small enough that collisions occur, then the statement insertion order will have an effect on the output order, which would affect the ability to serialise the RDF/JSON graph directly from the order of Statements in the Set.

So in reality, the ValueComparator doesn't need to follow SPARQL standard ordering unless we need to mimic SPARQL. For other purposes the current implementation is fine as RDF is not ordered itself. As long as the StatementComparator can be used to order a TreeSet to make RDF/JSON rendering efficient it is doing its job in my opinion.

This was more a suggested improvement than a need. From the Sesame point of view, if they want to strictly follow the standard for SPARQL ORDER BY, then they need to update their ValueComparator implementation. Their implementation already localises equivalent values next to each other for the RDF/JSON requirements.

joshsh commented 13 years ago

Ah, that's right. Thanks for clarifying. As far as this issue is concerned, I think we agree that the current lexical ordering of literals is acceptable w.r.t. current requirements. In addition, it appears that Sesame's SPARQL implementation should use the total order defined by SPARQL 1.1, but does not yet. At some point in the future, the Sesame comparator might serve as a SPARQL-compliant alternative to the current lexical ValueComparator, for applications which need it.