w3c / data-shapes

RDF Data Shapes WG repo
87 stars 33 forks source link

Improve validation rules for XSD integer datatypes #142

Open tpluscode opened 2 years ago

tpluscode commented 2 years ago

What

With SHACL 1.0 the sh:datatype constraint mandates exact match. That is, given a property shape with sh:datatype xsd:int, it will fail the check against the a value ""10"^^xsd:integer even though the type xsd:int is subtype of xsd:integer

As a reminder, here's the XSD types hierarchy from https://www.w3.org/TR/xmlschema-2/

type-hierarchy

Why

This is problematic because it turns out that multiple triplestore implementations will proactively "normalise" integer types on insert. As above, inserting "10"^^xsd:int will in fact store "10"^^xsd:integer in the database. I tried Fuseki (TDB1), Stardog and Allegrograph. It seems that all integer and derived datatypes will treated this way as long as their lexical form matches the given datatype. "-10"^^xsd:negativeInteger becomes "-10"^^xsd:integer etc.

This makes it impossible to roundtrip values created by a shape as shown below because when the resource comes back from the store, the type of schema:age will no longer match that of the property shape

<>
  a sh:property [
    sh:path schema:age ;
    sh:datatype xsd:nonNegativeInteger ;
  ] .

Proposal

I propose to prescribe to shacl processors that they should also "normalise" the property shapes but deconstructing xsd integer datatypes to xsd:integer and appropriate min/max as defined by XML Schema (unless explicitly provided by the shape itself)

sh:datatype default sh:minInclusive default sh:maxInclusive
xsd:long -9223372036854775808 9223372036854775807
xsd:int -2147483648 2147483647
xsd:short -32768 32767
xsd:byte -128 127
xsd:nonPositiveInteger none 0
xsd:negativeInteger none -1
xsd:nonNegativeInteger 0 none
xsd:positiveInteger 1 none
xsd:unsignedLong 0 18446744073709551615
xsd:unsignedInt 0 4294967295
xsd:unsignedShort 0 65535
xsd:unsignedInt 0 255
HolgerKnublauch commented 2 years ago

Thinking out loud, I expect it would be hard for SHACL 1.1 to break SHACL 1.0 semantics. So for this to work without breaking the existing logic I guess SHACL 1.1 could introduce a second, optional flag for sh:DatatypeConstraintComponent to also accept, e.g. xsd:int when xsd:integer was requested.

So maybe sh:datatype xsd:integer ; sh:datatypeNormalization true ;

Furthermore, triple stores could indicate whether they can round-trip datatypes correctly and if not the SHACL engine can set the above flag true as the default.

tpluscode commented 2 years ago

I don't necessarily agree that it's breaking 1.0 semantics for you cannot break what's already broken.

I think what stores do make sense from the perspective of XSD and robustness principle. All types derived from xsd:integer are essentially that plus range constraints. That is pretty much XSD's definition. Being "liberal in what we accept" means a validator should treat "10"^^xsd:integer as a valid xsd:byte

Thus, I conclude that the current behaviour of SHACL is flawed and deserves as minor version fix.

Furthermore, triple stores could indicate whether they can round-trip datatypes correctly and if not the SHACL engine can set the above flag true as the default.

I don't think this is practical. A web app which dereferenced the resource and shape has no idea whatsoever about backing triplestores

afs commented 2 years ago

Useful table.

A second flag to modify the constraint is better IMO. sh:datatypeByValue ("normalization" is usually more like a change to canonical form "0001" and "1" but same datatype).

As this does work today in many places, if not all, and some systems may currently wish to validate the exact datatype used (xsd:nonNegativeInteger is meaningful as it isn't due to the arbitrary 32 bit/64 bit issues), I don't think it can be classified as "broken".

The majority user feedback on TDB1 was to keep the datatype. Normalizing the lexicial part was just accepted.

tpluscode commented 2 years ago

I think I will stand by "broken" since the base semantics of datatypes is defined by XSD spec. The stores will turn derived literal with XSD types to xsd:integer because that is what they are (the literals, not store). Of course, not everyone necessarily uses triple stores at all but it will be a major use case. And TDB is not the only implementation which does that.

Like I said, if I create a graph and it does not validate after round tripping from my database then this is a bug, not a feature. But I hear you, if the concern is breaking users depending on the current spec, then I'm probably proposing a change for SHACL 2.0 :)

Ah, and you make a good point about comparing the lexical form too. We could consider more holistic improvements, for example also to sh:hasValue. Consider https://s.zazuko.com/2boDM, where sh:hasValue 1 fails to validate "0001"^^xsd:integer

HolgerKnublauch commented 2 years ago

@tpluscode well, if the round-tripping does not work then it's a problem that the graph stores need to solve, not languages like SHACL downstream. I guess other systems like OWL (with its xsd:nonNegativeIntegers everywhere) would have similar issues. Fix it at the lowest possible level.

I like the suggested sh:datatypeByValue property.

TallTed commented 2 years ago

[@tpluscode] given a property shape with sh:datatype xsd:int, it will fail the check against the a value ""10"^^xsd:integer even though the type xsd:int is subtype of xsd:integer

I'm surprised no-one else flagged this, but I think your example is backwards.

That is, even were subtype inferences active, it would be correct for ""10"^^xsd:integer to fail against sh:datatype xsd:int, because while all xsd:int are xsd:integer, some xsd:integer are not xsd:int, as that is the definition of a subtype.

On the other hand, subtype inferencing might enable ""10"^^xsd:int to succeed against sh:datatype xsd:integer.

afs commented 2 years ago

@TallTed I see your point but it can also be seen as a use case for validation by value.

F&O does not treat derived types of integers specially and any operation returns xsd:integers, "0"^^xsd:int+"10"^^xsd:int is "10"^^xsd:integer.