mff-uk / dataspecer

https://dataspecer.com
MIT License
19 stars 7 forks source link

SHACL targetting support #353

Open LadyMalande opened 6 months ago

LadyMalande commented 6 months ago

Analysis of possible SHACL targetting

In order to validate data, SHACL needs to set a node, which is going to be targetting specific structure in data. From this targetted node, SHACL specifies a structure to which the data needs to conform. The target does not need to be the root of the data nor the data structure as SHACL allows pointing to properties that are coming into the targetted node, as opposed to usually describing structure downwards the structure tree.

Main cases for SHACL targetting

This part describes the main cases of targetting where the common tools of SHACL language can be used directly.

1) The root of the structure ALWAYS has unique type

If the rdf:type of the structure is unique, it can be targetted by sh:targetClass. This can only be used when "Explicitní určení typu instancí" is set to vyžadováno, because if the type determination is missing in the data, the data won't be checked against created shape. Each value of sh:targetClass in a shape is an IRI.

2) The root of the structure contains a UNIQUE predicate leading from it

If the root has a unique predicate in attributes with cardinality [1..x], it can be used for targetting by sh:targetSubjectsOf. If the predicate is [0..x], there is a chance the predicate is not available in data, not allowing the shape to find a focusNode. The values of sh:targetSubjectsOf in a shape are IRIs.

3) The root has an attribute, which has ALWAYS UNIQUE type

If the rdf:type of the structure is unique, it can be targetted by sh:targetClass. This can only be used when "Explicitní určení typu instancí" is set to vyžadováno for the targetted attribute class, because if the type determination is missing in the data, the data won't be checked against created shape. The only difference to 1) is, that the structure above the targetted class has to be found using reverse paths (using sh:path: [ sh:inversePath ex:parent ]). Each value of sh:targetClass in a shape is an IRI.

4) The root has an attribute that contains a UNIQUE predicate leading from it

If the targetted attribute has a unique predicate in attributes with cardinality [1..x], it can be used for targetting by sh:targetSubjectsOf. The only difference to 2) is, that the structure above the targetted node has to be found using reverse paths (using sh:path: [ sh:inversePath ex:parent ]). The values of sh:targetSubjectsOf in a shape are IRIs.

Outliers for SHACL targetting

Section below discusses cases where usual SHACL defined tools don't work.

1) There are NO unique types

In the whole structure, there are no unique types of any class. In that case, the rdf:type will be the same for different nodes in different levels of the structure and both will be targetted with sh:targetClass. But the aim is to check the structure from only one of them and the other focusNodes will give false Negative results while being checked. In that case let's check for second option, checking for unique predicates.

2) There are also NO unique predicates

In the whole structure there is no unique predicate with the cardinality [1..x] that would ensure it always occures in the data.

3) There are unique unique types but the explicit typing of instances is not mandatory

If the unique class has Explicitní určení typu instancí is set to OTHER than vyžadováno, there is no guarantee the data will have the rdf:type specified. In that case there is no certainly focusable node in the data.

4) There are unique predicates but are OPTIONAL ... [0..x]

In case there are unique predicates but they are only optional with cardinality [0..x], there is no guarantee they will be present in the to-be-validated data.

If all outliers 1)-4) are true in the structure, there are no basic SHACL means how to target the focusNode from where to validate the data.

Solution to outliers

1) Tell the user to reconsider

If all techniques in the first section of this descussion fail, maybe it could be good to just TELL the user about this issue in an informative matter and ask them to SET the conditions in whichever manner they feel the most comfortable having their data structure usage in mind. Then the user may fix one of the problems at hand by:

2) Doing one of the suggested helping solutions automatically for the user

3) Having typed instances mandatory for the root

The root of the data structure could be mandatort to be typed. In case the user still changes the behavior, there could be warning that the SHACL can't be generated unless the user changes one of the problems. If the user doesn't want to change the data structure to allow for SHACL targetting, the final solution could be

Conclusion

This is a space for further discussion of what could be the other custom final solution of the problem with targetting.

jakubklimek commented 6 months ago

This is a nice overview of the options. I think in many of the cases, a simple, but informative error while generating SHACL will be an OK solution at this stage. Of course, this analysis needs to be present in the user documentation of the SHACL generator.

I think the most common case will be non-problematic, i.e. root has a unique class and we can take it from there. The second most common case will be that there is the root class repeated somewhere in the data. In that case, the user can actually solve it e.g. by prohibiting typing in the non-root classes, making the root one unique.

I see the main cases 2, 3 and 4 maybe a bit unintuitive for the user. If implemented, there should be some kind of a note in the generated shapes explaining what is happening and why, e.g. why a certain shape is targeting subjectsOf some, at the first glancce, "random" predicate to target the root.

The rest of the cases I would detect and write an (informative) error during generation for the time being. Later, if time permits, the individual cases can be improved.

LadyMalande commented 5 months ago

Implementation plan:

1) If the cimIRI of the root class is unique and typing set vyžadováno in the structure, use sh:targetClass. 2) If the root has a unique predicate in attributes with cardinality [1..x], use sh:targetSubjectsOf. 3) If the always present [1..x] child of the root has unique class and typing set vyžadováno, use sh:targetClass on it and reference the root with the help of inversePath. 4) If the always present [1..x] child of the root has a unique predicate leading from it, use sh:targetSubjectsOf and reference the root with the help of inversePath. 5) If neither holds, write an informative note to the user what is missing from the structure for successful targetting and display the note instead of generated shape.

jakubklimek commented 5 months ago

Yes, this seems like a sound plan. Regarding 5. - there is a mechanism in place for failing artifact generation so that it is clear that there is a problem instead of just generating a syntactically invalid note.

LadyMalande commented 4 months ago

The targetting for SHACL has been implemented, there is a revision due on case 3 and 4 as in some cases the algorithm has not been implemented correctly. The case number 5 throws obrazek and says more in developpers console. Not yet closing the issue as the implementation of 3 and 4 has to be revised.

jakubklimek commented 4 months ago

@LadyMalande Wouldn't it be better to at least hint at what went wrong with the generation? e.g. "Nepovedlo se vygenerovat SHACL pravidla kvůli nejasnosti v datové struktuře"