w3c / shacl

SHACL Community Group (Post-REC activitities)
27 stars 4 forks source link

identity vs equality #13

Open VladimirAlexiev opened 2 years ago

VladimirAlexiev commented 2 years ago

(Originally posted as https://lists.w3.org/Archives/Public/public-shacl/2022May/0000.html; edited below).

Several constraint components compare values using identity (sameTerm) rather than equality (=). Which is ok for many cases (URLs, strings), but not all (numbers, booleans); dates are also peculiar.

For numeric and boolean literals, identity uses the lexical space, whereas equality uses the value space. The following are true for equality but not for identity:

Which means that it's unnecessarily hard or even impossible to express rules like this in standard SHACL (examples from https://transparency.ontotext.com/spec/#validation-rules):

Looking at some constraint components in the spec:

I propose these clarifications:

Furthermore, I propose to add new constraint components that use equality rather than identity. I cannot come up with better names, so suggestions are welcome. I guess it's impossible to add them to sh: so maybe add them to dash: ?

VladimirAlexiev commented 2 years ago

@HolgerKnublauch

Yes, the interpretation should be sameTerm, and the prose could sometimes be clearer on that. Blame the editor on the latter. Unfortunately there is no SHACL 1.1 WG to fix formal definitions that could lead to further controversies. Meanwhile, I guess if implementors are not sure what to do, they look at other implementations.

Ok, I can understand that.

Technically, one reason for doing sameTerm semantics was performance.

Agreed. I think that a lot of repositories (eg GDB) have additional "literal indexes" to handle comparisons like = and < quickly, but still, the standard spo (in particular o) index is faster

One way to evolve your use cases could be to introduce an optional boolean flag such as sh:matchEquality true which could be a second argument to sh:equals, sh:hasValue and sh:in. Another way would be (as you say) to introduce completely new constraint components.

Ok, but this flag should be in the enclosing PropertyShape? Like this?

# Power plant shape
sh:not [sh:property [sh:path tr:installedCapacity; sh:hasValue "0"^^xsd:float; sh:matchEquality true]]

# Outage shape
sh:property [sh:path (tr:energyResource tr:installedCapacity); sh:equals tr:installedCapacity; sh:matchEquality true]

@HolgerKnublauch, this sounds good, is it feasible to standardize dash:matchEquality?

@afs

Another option ("as well as", not "instead of") is to describe a validation mode. It would also cover the case where the data were to be canonicalized as some triplestores already do.

I'm less keen on this since it seems likely both modes could be needed for the same set of shapes.

afs commented 2 years ago

both modes could be needed for the same set of shapes.

But adding a triple flag does not work. In RDF, subgraphs can stand alone. So the shape, without the triple must also work. If adding or deleting the triple changes the meaning of another triple (the actual constraint), that is feature lost.

Separate properties or canonicalize the data (which may be logically canonicalize the data).

HolgerKnublauch commented 2 years ago

@afs I don't understand this argument. There are already some constraint components that take multiple arguments, e.g. sh:closed and sh:qualifiedValuesShape.

With canonical data, some choices have already been made when the data graph triples were added, so I guess the SHACL engine would simply need to ask the graph whether it has canonicalized the values or not and then probably also canonicalize the values mentioned in the shapes graph. For such data graphs, I guess it doesn't even have the option to compare using sameTerm.

afs commented 2 years ago

In this case, the new property says "ignore how the other property is defined, do value matching". That is different to sh:pattern/sh:flags, which is a description of the regex, because the defn of sh:pattern mentions sh:flags.

it has canonicalized the values

Or the data is presented to the SHACL engine as canonicalized. It can remain in term form. Yes, compare using sameTerm is not available.

While RDF is term-centric, it is going to be a bit convoluted to get right.