w3c / data-shapes

RDF Data Shapes WG repo
87 stars 33 forks source link

Where should my ontology go? Data graph versus shapes graph #155

Closed wouterbeek closed 4 months ago

wouterbeek commented 4 months ago

Observation

According to the SHACL standard, two graphs are relevant for validation: the data graph and the shapes graph. The ontology should be part of the data graph:

The data graph is expected to include all the ontology axioms related to the data and especially all the rdfs:subClassOf triples in order for SHACL to correctly identify class targets and validate Core SHACL constraints.

This seems counter-intuitive to me, since I associate the ontology more with the shapes graph. For example, a shapes graph can owl:import an ontology.

Example

To illustrate my unease, let's take the following data graph:

prefix id: <https://example.com/>
prefix foaf: <http://xmlns.com/foaf/0.1/>

id:john a foaf:Person.

And the following shapes graph:

prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix sh: <http://www.w3.org/ns/shacl#>

[] sh:targetClass foaf:Agent;
   sh:property
     [ sh:minCount 1;
       sh:path foaf:name ].

Adding the following ontology graph is crucial, otherwise we cannot invalidate the data graph which is missing a foaf:name statement:

prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

foaf:Person rdfs:subClassOf foaf:Agent.

Use case

I have a specific use case where this comes up: in TriplyETL we stream though the instance data. The stream passes along millions of small data graphs. For each of these data graphs, we have to add the ontology before the data graph can be validated in-stream. In this use case, it makes more sense to add the ontology to the shapes graph once, and use that same shapes graph to validate all data graphs that pass by.

Expected

I expect either of the following:

HolgerKnublauch commented 4 months ago

I had similar concerns for a while, and many people stumble over the semantics of sh:class. In 99% of our scenarios, data graph == shapes graph, but I can understand that there are plenty of other scenarios, as you describe.

Maybe for 1.2 we could say that rdfs:subClassOf triples may be in either data graph OR shapes graph? What do others think?

simonstey commented 4 months ago

may be in either data graph OR shapes graph?

or that the validation itself works on dg ∪ sg so it doesnt really matter?

fwiw pyshacl mixes an optional ontology also into the datagraph -> https://github.com/RDFLib/pySHACL/blob/master/pyshacl/validate.py#L211-L222

HolgerKnublauch commented 4 months ago

If validation gets executed against some remote SPARQL endpoint, it is difficult/impossible to require a union of shapes and data, as the shapes graph may not even exist as a named graph on that database. Also, if a shapes graph is large, you don't want all nodes from there to be validated each time.

But for selected operations that require the subclass hierarchy, such as computing target nodes and sh:class, having more flexibility should work. For example, an engine that produces SPARQL queries can look at the class definitions in the shapes graph to produce a VALUES clause etc. In my own implementation, I also use some helper data structures for classes to make the computation of class membership more efficient.

HolgerKnublauch commented 4 months ago

BTW this is the wrong place to report suggestions for SHACL 1.2. Would someone please create a dedicated ticket on

https://github.com/w3c/shacl/issues

and then close the ticket here?

wouterbeek commented 4 months ago

Opened over at https://github.com/w3c/shacl/issues/34