w3c / data-shapes

RDF Data Shapes WG repo
87 stars 33 forks source link

Request: Enable SHACL Shapes to be defined as part of data #139

Open mgberg opened 2 years ago

mgberg commented 2 years ago

Motivation

I have run into several situations where I would find it helpful to have shapes be defined as part of data instead of at the schema level and have data effectively refer to the shapes that it should conform to.

For example, consider this function ontology. If you look at the descriptions for fno:Parameter and fno:Output they look very similar to sh:PropertyShape (or perhaps sh:Parameter) in spirit, and the class fno:Function is therefore like sh:NodeShape (or perhaps sh:Function). If these classes were additionally modeled as subclasses of the corresponding SHACL class, then the focus nodes of an instance of fno:Function (and sh:NodeShape) would be the instances of fno:Execution connected to it via fno:executes.

Another potential use case for this would be some future state of the W3C Data Cube ontology. Data Structure Definitions could be changed to be represented as Node Shapes, and the focus nodes of one of these would be the Observations that are part of the DataSet(s) that have that Data Structure Definition. This would allow an arbitrary Data Structure Definition to be used to validate the DataSets that have that structure in addition to describing the structure. This would require the use of a path instead of a single predicate, but it's still a shape defined as part of the "data".

I've come up with a couple possible ways this feature (or at least the beginnings of it) could potentially be implemented. I haven't had the time to dig into SHACL validator implementation details, so I'm not sure how feasible either of these options are to actually implement in a current SHACL validator. I'm curious to know what others think of this idea and/or either of these approaches.

Potential Implementation 1: Constraint Component

One possible implementation would be to create a new Constraint Component that functions somewhat like sh:node, using a property perhaps called sh:nodePath. However, instead of specifying the URI of a Node Shape that focus nodes must also conform to, it specifies a SHACL path pointing to Node Shape(s) that focus nodes must also conform to.

This would enable the following additions for the function ontology:

fno:Execution
  sh:nodePath fno:executes ;
.

Or these additions for the Data Cube ontology:

qb:Observation
  sh:nodePath (
    qb:dataSet
    qb:structure
  ) ;
.

This has the benefit of applying to all instances of qb:Observation regardless of the specific Data Structure Definition relevant to a given Observation. Note that this doesn't apply a targeting rule based on a path independently; if there were multiple classes/shapes that would also conform to a shape at the same path, that would have to be expressed multiple times. That's not a dealbreaker though, just a comment.

My main reservation with this approach is that I'm not a huge fan of how if sh:node fails validation, the error message generally states just that validation failed and not why it failed (like how the original SHACL Playground example says "Value does not have shape schema:AddressShape" instead of the actual error message "Value is not >= 10000"). I know sh:detail exists and I hope that it could be used for this as well to provide useful messaging, and I hope that more validators would use/take advantage of sh:detail in the future.

Potential Implementation 2:

I realize that this option really pushes the limits of what targets are supposed to do, which is part of the reason why I think the first approach is cleaner and more practical, but I thought I might as well include it anyway. This involves the creation of a new flavor of Target Type. Here's a (non-functional) prototype:

sh:PathFromShapeTarget
  a sh:TargetType ;
  rdfs:subClassOf sh:Target ;
  sh:parameter [
    sh:path sh:path ;
    sh:description "The path connecting one or more shapes to their focus nodes" ;
  ] ;
  sh:select """
    SELECT ?this ?currentShape
    WHERE {
      ?shape $PATH ?this
      FILTER EXISTS {?shape a/rdfs:subClassOf sh:NodeShape}  # This may or may not be necessary
    }
    """ ;
.

Targets then take a SHACL path to identify the path desired and injects it in the SPARQL query like the path in a sh:SPARQLSelectValidator (if implemented in SPARQL). So for the example function ontology:

fno:FunctionTarget
  a sh:PathFromShapeTarget ;
    sh:path [
      sh:inversePath fno:executes ;
    ] ;
.

Or for the Data Cube example:

qb:DataStructureDefinitionTarget
  a sh:PathFromShapeTarget ;
  sh:path [
    sh:inversePath (
      qb:dataSet
      qb:structure
    ) ;
  ] ;
.

Now, I realize that following the current behavior of Node Shapes and Targets, each and every fno:Function or qb:DataStructureDefinition would require an extra triple connecting it to the above example Target via sh:target. While that could work, it feels like that the Target itself really captures the meaning that the specified path connects shapes to focus nodes independently of the specific shape and that requiring that extra triple to exist every time feels redundant.

Therefore, it would be nice if it were possible to enable that functionality, which is why I added the ?currentShape variable in the query; either a shape could be bound to ?currentShape get focus nodes, or a (potential) focus node could be bound to ?this to obtain the shapes that apply to that node via the path. This would have the benefit of applying the targeting rule to all occurrences of the property or the property path independently, unlike the first approach where the property or path would need to be specified on each shape it applies to.

mgberg commented 1 year ago

I noticed that the function tosh:hasShape currently exists in TopBraid EDG. This function, wrapped as a custom constraint component, could serve as a basic version of Potential Implementation 1 described above. A start for an implementation of this is the following:

ex:BasicNodePathConstraintComponent
  a sh:ConstraintComponent ;
  sh:message "Value {$value} does not conform to a shape at path {$nodePath}" ;
  sh:parameter [
      a sh:Parameter ;
      sh:path sh:nodePath ;
    ] ;
  sh:validator [
      a sh:SPARQLAskValidator ;
      sh:ask"""
ASK {
    $value $nodePath $shape .
    FILTER (tosh:hasShape($value, $shape))
}""" ;
      sh:prefixes <http://topbraid.org/tosh> ;
    ] ;
.

This example implementation is equivalent to Potential Implementation 1 with (at least) two limitations:

  1. This would only support a URI of a predicate as a parameter, not an arbitrary valid SHACL path
  2. Error messaging returned from this would be the limited sh:message above, not the messages generated from the violated shapes as if it were an actual target.

While this could be a potential workaround solution for this use case, its usefulness is limited due to the above limitations as well as the fact that it would only work in TopBraid.