w3c / shacl

SHACL Community Group (Post-REC activitities)
27 stars 4 forks source link

Where should my ontology go? Data graph versus shapes graph #34

Open wouterbeek opened 4 months ago

wouterbeek commented 4 months ago

Originally posed over at https://github.com/w3c/data-shapes/issues/155; also see the comments by others over there.

Observation

According to the SHACL standard, two graphs are relevant for validation: the data graph and the shapes graph. The ontology should be part of the data graph:

The data graph is expected to include all the ontology axioms related to the data and especially all the rdfs:subClassOf triples in order for SHACL to correctly identify class targets and validate Core SHACL constraints.

This seems counter-intuitive to me, since I associate the ontology more with the shapes graph. For example, a shapes graph can owl:import an ontology.

Example

To illustrate my unease, let's take the following data graph:

prefix id: <https://example.com/>
prefix foaf: <http://xmlns.com/foaf/0.1/>

id:john a foaf:Person.

And the following shapes graph:

prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix sh: <http://www.w3.org/ns/shacl#>

[] sh:targetClass foaf:Agent;
   sh:property
     [ sh:minCount 1;
       sh:path foaf:name ].

Adding the following ontology graph is crucial, otherwise we cannot invalidate the data graph which is missing a foaf:name statement:

prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

foaf:Person rdfs:subClassOf foaf:Agent.

Use case

I have a specific use case where this comes up: in TriplyETL we stream though the instance data. The stream passes along millions of small data graphs. For each of these data graphs, we have to add the ontology before the data graph can be validated in-stream. In this use case, it makes more sense to add the ontology to the shapes graph once, and use that same shapes graph to validate all data graphs that pass by.

Expected

I expect either of the following:

HolgerKnublauch commented 4 months ago

I guess rdfs:subClassOf triples are what matters here. They impact sh:class and class-based targets.

I believe we could change SHACL Core so that these triples will be considered from the union of data and shapes graphs.

Would this address your concern or are there other triples in the data graph that should also be in the shapes graph and vice versa?

bergos commented 4 months ago

Can't we define that all rdfs:subClassOf reasoning must happen in the shapes graph?

A union graph could make some edge cases, like validating constraints on a SHACL shape, difficult to process. A flag to enable that feature could solve it, but we should only consider it if there are other use cases than the rdfs:subClassOf reasoning.

HolgerKnublauch commented 4 months ago

If we were to ignore rdfs:subClassOf triples from the data graph then we would introduce a breaking change to SHACL, which is something we definitely want to avoid for this (incremental) release. Adding the shapes graph as an extra graph to process is less likely to break existing use cases. But even that is potentially breaking.

wouterbeek commented 4 months ago

An alternative solution is to introduce a new 3rd graph:

  1. Data graph, containing instances of the classes defined in (2) and (3).
  2. Shapes graph, containing node shapes and property shapes for the instance data in (1). This is the closed half of the data model (SHACL).
  3. Ontology graph, optionally containing classes and properties for the instance data in (1). This is the open half of the data model (RDFS/OWL).

In SHACL 1.0 graph 3 is never given.

In SHACL 1.1, it becomes possible to optionally specify graph 3. If graph 3 is specified, then all class and property statements (RDFS/OWL, including rdfs:subClassOf) are assumed to be in that graph. A user can choose to specify the same graph for (2) and (3).