RDF graphs with value-space literals

w3c / rdf-star-wg

RDF-star Working Group

Other

25 stars 8 forks source link

RDF graphs with value-space literals #136

Open pfps opened 1 week ago

pfps commented 1 week ago

It appears that some RDF implementations build RDF graphs where literals with recognized datatypes are represented if they were members of the value space instead of in their lexical form. This does not appear to be sanctioned by the RDF recommendations. So "1.99999999999999999999999999999"^^xsd:float is stored as the IEEE floating point number 2 and "2"^xsd:byte, "2"^^xsd:short, "2"^^xsd:int, and "2"^^xsd:long are all stored as the integer 2.

Would it be possible to liberalize the treatment of literals with recognized datatypes in RDF to support this? SPARQL entailment regimes already legitimize something along these lines for SPARQL.

afs commented 1 week ago

SPARQL Query does not say anything about how the dataset came into existence - a dataset is given to the query execution. That dataset may be "produced" from another: for example, by converting to canonicalized lexical form as part of the parsing process. This is not covered by SPARQL (this is intentional). SPARQL does have to deal with expression results which are values; some cases are not simple: MONTH("2024-01-02"^^xsd:date) can be expected to be "02"^^xsd:integer despite not being canonical.

So "1.99999999999999999999999999999"^^xsd:float is stored as the IEEE floating point number 2 and "2"^xsd:byte, "2"^^xsd:short, "2"^^xsd:int, and "2"^^xsd:long are all stored as the integer 2.

Allowing value-centric interpretation of RDF (including syntax to indicate values, for example "+012"~~xsd:integer) should go on the features for future consideration.

mkroetzsch commented 1 week ago

I have recently encountered the same issue: the standard reserves the term "graph" for the RDF abstract syntax, but does not foresee any way of interpreting an RDF document as a (semantic) graph. However, the abstract syntax RDF graph is almost semantic as it is: IRIs do not get interpreted anyway (RDF assumes simple string equality), and bnodes in the abstract syntax are already ID-free. The only syntactic bit left are the syntactically encoded literals. A graph structure where these literals would be replaced by the data values they stand for would be very easy to define. RDF already has the necessary interpretation definitions ready, even for cases where a type is unsupported -- someone just needs to give that graph-based semantics a name so people can refer to it when describing what they already do.

Indeed, practitioners often want to use RDF to represent graphs semantically rather than viewing graphs only as a syntactic abstraction on the way to more complex model theories. Most people in practice already seem to think RDF is a standard for exchanging graphs. Such a fully semantic view would also be in harmony with the proposal in rdf concepts issue #60.

The semantic view would also be helpful for training and teaching. I saw major confusion in students when trying to explain that RDF graphs, while being abstract in some sense, still contain concrete syntactic elements that are further interpreted only later on. It would be easier to say that RDF is a standard for representing graphs that make connections between IRIs, bnodes, and concrete data values, instead of having to introduce datatypes and literal syntax first. In particular datatypes have a lot of technical baggage that is not essential to understanding what RDF data means, e.g., all the subtypes of xsd:decimal that merely impose syntactic constraints but lead to indistinguishable values in a common number domain. To explain the semantic view on graphs, one would merely have to say that RDF graphs can contain decimal numbers (with integers as a special case), without talking about lexical representations and parsing (aka lexical-value mapping) yet.

pchampin commented 1 week ago

@mkroetzsch I sympathize with the notion that literals are "overly syntactic", and with the proposal of this issue in general.

However, I'm not comfortable with considering that "graphs that make connections between IRIs, bnodes, and concrete data values" would be "semantic graph", and ultimately more "homogeneous" (on the syntax-semantics spectrum) that RDF graphs are...

The triple dbr:Tim_Berners-Lee dbp:birthDate "1955-06-08"^^xsd:date does not relate the date "8 June 1955" to the IRI dbr:Tim_Berners-Lee, it relates that date to the person denoted by that IRI. Both IRIs and literals are (in that example) in the domain of syntax, while persons and dates are in the domain of discourse.

lisp commented 1 week ago

given that we are among the implementations that do this, i sympathize with the intent. i suggest, however, that it is out of scope, as proper consideration would require more effort than the current time constraints allow. among the issues which would not likely be trivial to resolve,

would this the take the form of a capability which is declared by an implementation or would the recommendation just allow it?
would the representation be the consequence of just d-entailment or could it involve further canonicalization or normalization? we, for instance, normalize all temporal locations to zulu time, which is a step beyond d-entailment.
does it matter that it would create a mutual dependency between the concepts recommendation and the entailment recommendation?

gkellogg commented 1 week ago

This would also requiring a value-to-lexical description for datatypes, which would probably be their canonical representation, but it would be much more challenging to define for rdf:HTML, rdf:XMLLiteral, and rdf:JSON datatypes. Right now, datatype descriptions describe the lexical-to-value mapping, but not the inverse.

mkroetzsch commented 1 week ago

@pchampin I completely agree with your view that "semantic" seems to be the wrong term here. There are many stages of interpretation to get from a sequence of bytes to some open-world model theory; trying to make do with just two adjectives "syntactic" and "semantic" is bound to be confusing ;-) The distinction between IRI and resource is clear to me. Having a graph view that avoids literal syntax while using IRIs should not be interpreted as an attempt to identify resources with IRIs (thus introducing a kind of unique name assumption). The graph structure is really just the structure that tools with datatype support "see", before applying whatever RDF semantics they want to use further on.

So what we are discussing here is essentially "abstract syntax with datatype support". The representation depends on which types are recognised. The current abstract syntax is what you get if no datatype is recognized (using the set of RDF literals as the fallback value space for unknown types, as usual). If further datatypes are supported, then tools can just replace them by their values already during parsing (which is the next thing simple D-entailment would do anyway).

Allowing graph representations that use values for some literals could also remove possible confusion in typical RDF applications. For example, Turtle syntax supports expressions like 42 and +42 to denote integers, but it is not clear to me (from the spec) if the abstract syntax graph obtained from a Turtle file should make a distinction between the two, or between other writings like "42"^^xsd:byte. It is difficult to explain to practitioners (and students) that Turtle support somehow requires tools to "know" integers and still does not allow them to abstract from superficial syntactic details during parsing.

mkroetzsch commented 1 week ago

@lisp @gkellogg As I understand the proposal, this is not meant to provide a new representation of RDF that can somehow be turned back into literals with lexical values. Tools that use an internal representation that merely represents values would still be free to syntactically return RDF in any (not necessarily canonical) form that denotes the same values. How they represent values internally is not regulated by the standard. This is similar to the view taken in D-entailment.

@lisp Your treatment of timezones may not be fully compatible with the value space defined for xsd:date (and friends), but I believe that RDF allows you to use a datatype of your choice for xsd:date, so the modification could be accommodated within the boundaries of conformance. This also allows you to print values as you like (as long as the literal you return denotes the value you had internally). Naturally, your output would refer to your built-in datatype interpretation rather than to the official XML Schema version. The proposal made here does not impose any new requirements on such cases.

Re "mutual dependency between the concepts recommendation and the entailment recommendation". If the concepts would explain how to map literals to values, then the entailment recommendation would not need to do the same again. So this editorial issue would be solved by moving a content rather than by mutual references between specs. The only special case to handle is inconsistency due to ill-typed literals (it is technically not a problem to flag such inconsistency during literal parsing, but it would introduce the idea of inconsistency into RDF concepts). Maybe this needs a separate discussion, since it is also strange that the object 3,1 would be a syntax error in a Turtle triple whereas the object "3,1"^^xsd:integer would be a semantic inconsistency of the whole graph.

Antoine-Zimmermann commented 1 week ago

I am somewhat worried about what I read here. Is this issue saying that some implementations consider that the graph written in Turtle like this:

 <s> <p> "2.0"^^xsd:decimal .

is the same as what is written in Turtle like that?

<s> <p> "2"^^xsd:integer .

And, am I correct saying that the suggestion is to make this choice explictly allowed in the spec?

Antoine-Zimmermann commented 1 week ago

@mkroetzsch

Turtle syntax supports expressions like 42 and +42 to denote integers, but it is not clear to me (from the spec) if the abstract syntax graph obtained from a Turtle file should make a distinction between the two, or between other writings like "42"^^xsd:byte

The Turtle spec is quite clear to me: 42 is parsed exactly as "42"^^xsd:integer and +42 as "+42"^^xsd:integer. The production rule for INTEGER says:

The literal has a lexical form of the input string, and a datatype of xsd:integer.

Also, be careful with the use of the word "denote", which has a normative meaning in RDF Semantics. 42 in Turtle does not denote anything in this sense. It is just syntactic sugar for the literal ("42", http://www.w3.org/2001/XMLSchema#integer), which may or may not be understood as denoting the integer fourty-two, depending on the entailment regime.