Revise distinction between simple literals and xsd:string literals

rubensworks commented 4 years ago

SPARQL 1.1 considers simple literals (e.g. "string") as something else than literals with the xsd:string datatype (e.g. "string"^^xsd:string).

RDF 1.1 defines the following:

Simple literals are syntactic sugar for abstract syntax literals with the datatype IRI http://www.w3.org/2001/XMLSchema#string

Since SPARQL 1.1 predates the RDF 1.1 spec, and is based on RDF 1.0, which considers these two forms as non-equal.

This means that in SPARQL 1.1, the following two queries can produce different results:

SELECT * where { ?s ?p "string" }

SELECT * where { ?s ?p "string"^^xsd:string }

This inconsistency leads to some unfortunate problems when handling SPARQL using RDF 1.1 tools, such as those following RDF/JS.

(This is according to my understanding of the SPARQL 1.1 spec. If any of the original SPARQL 1.1 authors could clarify the original intention, that would be great.)

Previous work

As far as I know, no solutions exist yet. But this problem occurs in the wild.

Proposed solution

In line with RDF 1.1, I propose to make simple literals syntactical sugar for xsd:string-typed literals. As such, a SPARQL parser would consider the following forms equivalent:

SELECT * where { ?s ?p "string" }

SELECT * where { ?s ?p "string"^^xsd:string }

Considerations for backward compatibility

Applications that depend on this distinction may break, which may be acceptable considering the breakage that already occurs due to RDF 1.1 inconsistency.

Jamie-SA commented 4 years ago

We recently reviewed this after some RDFJS code changes. We have seen inconsistency in Triplestore results of the above two queries and in the serialized output. It seems at least one popular Triplestore does not treat the two as different and only returned results in one form (as if, internally they were the same thing), and at least one gave different results for the queries but some of their output serializations only output one of the two forms no matter what the input form was.

Our application code had been trying to maintain the distinction between the two forms but was running into problems between different Triplestore implementations and libraries like RDFJS. After reviewing the RDF 1.1 spec, it seems to me that RDF 1.1 is saying they are equivalent, and therefore the same literal. We decided the proper handling is to treat the two as identical/interchangeable and that the output form does not have to match the input form.

It seems having SPARQL officially declare "string" and "string"^^xsd:string as the same would be a good change that would make it more consistent with RDF 1.1.

And, if the 2 forms are treated the same in all places it would be nice to specifically state that the original form during input is not required to be maintained. It took a couple reads of the RDF 1.1 spec to come to the understanding that is essentially what they were saying.

I agree with Ruben's assessment on backwards compatibility. It is possible it will break existing code but already it seems existing Triplestore implementations are inconsistent. Having the query language specification inconsistent with the underlying RDF specification will only lead to implementation issues and more bugs. Having it clearly stated and in line with RDF 1.1 would prevent more problems going forward.

bcogrel commented 4 years ago

In Ontop, since our move to RDF 1.1, plain literals without lang tags (simple literals) have been systematically replaced by xsd:string, both in the RDF graph and in the SPARQL query itself.

In our understanding, in RDF 1.1 the notion of plain literal has simply disappeared and it is now replaced either by xsd:string or by rdf:langString (in the presence of a language type). We would therefore expect an RDF 1.1 processor not to be exposed to plain literals anymore.

This is what the SPARQL parser of RDF4J does since they moved to RDF 1.1. It basically applies the rule introduced in Turtle 1.1, where "myString" is now officially a shortcut for "myString"^^xsd:string: «The literal has a lexical form of the first rule argument, String. If the '^^' iri rule matched, the datatype is iri and the literal has no language tag. If the LANGTAG rule matched, the datatype is rdf:langString and the language tag is LANGTAG. If neither matched, the datatype is xsd:string and the literal has no language tag.» (https://www.w3.org/TR/turtle/#sec-parsing-terms).

From the next SPARQL specification, I would expect all the occurrences of plain literal to be removed and this Turtle 1.1 rule to be included.

In terms of backwards compatibility with SPARQL 1.1 + RDF 1.0, SPARQL 1.1 + RDF 1.1 would return more results (in the absence of negation) than the former. At the moment I have no case coming to my mind when the old distinction between simple literals and xsd:string was valuable. Instead, this distinction was clearly annoying for a lot of users, not to say implementers. Even this change may hypothetically break something, I would still consider it fair.

bcogrel commented 4 years ago

Another way to look at it is the following: if I am using SPARQL 1.1 to query an RDF 1.1 graph, what is the point of treating the constants in the SPARQL query differently, as RDF 1.0 constants? Simple literals would never match when used in triple/quad patterns.

afs commented 4 years ago

While SPARQL 1.1 was before RDF 1.1, the change was anticipated (and, in fact, discussed during SPARQL 1.0 - the value of both xsd:string and simple literals are the same value space and the WG tried to make things as similar as possible). SPARQL does not have its own, different data model.

"Simple literal" is terminology invented by the first SPARQL WG because there was no such terminology in RDF (1.0).

https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal uses the phrase "simple literal" to mean the syntax of no xsd:string. That reuse of the same terminology was intentional. The syntax rule of "no datatype" can be applied to SPARQL (query and update). You could even argue that it is already covered by "concrete syntaxes".

The SPARQL 1.1 "Argument Compatibility Rules" for string-related functions treat simple literal and xsd:string the same and the kind of return type matches the primary argument.

It is something for the next revision of SPARQL to complete.

FWIW Apache Jena changed at Jena 3.0.0 (July 2015). The only migration issue was that database data that used both simple literals and xsd:string together had to be reloaded. If data was one OR the other, it should have been OK (though a reload was advised).

SPARQL 1.1 did define DATATYPE to return rdf:langString for langtag literal - again, joint discussion between the working groups.

Language tags literals getting a datatype has caused more feedback but not much overall.

lisp commented 4 years ago

given

Applications that depend on this distinction may break,

and

FWIW Apache Jena changed at Jena 3.0.0 (July 2015). The only migration issue was that database data that used both simple literals and xsd:string together had to be reloaded. If data was one OR the other, it should have been OK (though a reload was advised).

is any background available on cases where the distinction had been made on purpose?

afs commented 4 years ago

I don't recall any arising; I can't think of any mention of mixed use.

There were some questions about why e.g. "^^xsd:string" had disappeared from Turtle output. An answer saying it was because of RDF 1.1 didn't get pushback.

gkellogg commented 4 years ago

IIRC, there were a couple of tests affected by merging plain literals and xsd:string literals. Also, the general advice on using either xsd:string or rdf:langString is to omit them when serializing. Indeed, neither Turtle, N3, nor JSON-LD can represent datatype and language simultaneously. Furthermore, the i18n namespace introduced to describe both BCP47 language and text direction is a datatype.

rubensworks commented 1 year ago

This will be resolved in SPARQL 1.2: https://github.com/w3c/sparql-query/pull/57

w3c / sparql-dev