Open cmungall opened 2 years ago
APPROACH:
Given the above, what is our approach?
First there should be only one 'place' to do validation. We should either:
Approach 1 is probably quite easy for users, but has the downside that linkml-validator is not visible as a distinct product in itself. There is some value in having it as distinct with its own documentation, credit, evolution, ...
Approach 2 is maybe quite confusing for users. The dependency chain is currently
linkml-validator --[depends on]--> linkml --[depends_on]--> linkml_runtime
Currently we have a large number of repos that have been set up with standard templates that make linkml a developer dependency and runtime a runtime dependency. If these repos want to take advantage of validation then they need to add an extra dependency.
Approach 3 may be desirable but not sure if it's possible. Validation inherently takes advantage of generators which are in linkml. I suppose the core validator framework could be in its own repo, and plugins may depend on linkml.
Overall I am leaning towards 1 but we would make sure that @deepakunni3 is OK with this and credits/history were sufficiently incorporated
Ticket plan looks great.
This is a great issue to get up to speed on what has happened so far in terms of validation in linkml 🎉
@cmungall I agree with your comments above and have a few thoughts I wanted to highlight, as I try to recall the points that led me to write linkml-validator.
ValidationReport
and ValidationResult
objects in linkml-validator were inspired by the validation vocabulary from linkml. But it wasn't used as-is because it is catered to RDF and SHACL and thus uses keys/slots that are RDF specific.
ValidationResult
has subject
, predicate
, object
slots which are not relevant when you are validating with JSONSchema. One could always force the JSONSchema reports into this pattern, but it maybe confusing to the typical developer who may not see data as a graph with S-P-O triplesValidationResult
where each ValidationResult
has one or more ValidationMessage
. Each ValidationMessage
is a unit of error specific to an object. In this way, a ValidationResult
is a collection of validation messages from a validation operation (JSONSchema validation, shacl validation, custom hand-coded validation). This may not be perfect, but it still allows for supporting use-cases that may not necessarily be in the scope of linkml, but allows for supporting a list of ever growing use-cases.Recommendation: It would be great to explore and see if we can merge the validation schema in linkml and linkml-validator so that we accommodate a wide range of use-cases.
linkml.validators.jsonschemavalidator:cli
; And there is also linkml-jsonschema-validate
and linkml-sparql-validate
Recommendation: It would be great if we can explore how to refactor the linkml.validators
submodule. Ideally, the linkml-validator should be able to fit in this submodule.
Happy to take a stab at this in the coming weeks :)
This is an "epic" ticket that attempts to unify a number of other separate issues and come up with a coherent solution
This first comment is about the background the second comment immediiately succeeding this is the approach
BACKGROUND:
Currently there are a number of different validation strategies. These are currently all implemented:
linkml-validate
, which is a part of the main linkml distribution. This uses jsonschema validator behind the scenesThere are a few issues with the existing validator (see #892, #891)
Additionally, when using the loader framework to instantiate dataclasses, validation at the python level will be performed. Complicating the picture, the current dataclass generation framework creates classes that perform silent coercion to correct data. While this follow's Postel's law it can be confusing, as it means that any validator what uses python objects as an intermediate will not detect classes of error that are silently repaired/coerced/fixed.
@deepakunni3 wrote design principles for validation here https://linkml.io/linkml-validator/design/
These are great, I will regurgitate verbatim here:
JsonSchemaValidationPlugin
performs JSONSchema validation on one or more objects. There are two types of plugins that are supported:BasePlugin
abstract class. This is to ensure that the external plugin class is compatible and plays well with the validator and other pluginsValidationReport
for that objectValidationReport
has one (or more)ValidationResult
, where eachValidationResult
is from one (or more) pluginValidationResult
has one (or more)ValidationMessage
(a structured message that describes the validation error)I agree with all of these.
Rather than "parseable validation messages" I would instead say that the validator should produce ValidationResult objects that conform to the standard validation data model. See validation.yaml. See also #368
The validation data model is not very well exposed at the moment, so I will describe it briefly here. It follows closely the SHACL validation shape model, and it uses the same vocabulary, e.g. http://www.w3.org/ns/shacl#MinCountConstraintComponent (see https://www.w3.org/TR/shacl/#core-components-shape).
OAK uses an extension of this datamodel we should fold back in to LinkML
Although the OAK implementation is hardwired for sqlite and OMO I think it is instructive to look at it and combine it with @deepakunni3's design. In particular of note is that the validator produces ValidationResult objects, which are typed with both severity and constraint violation type, and it is up to the wrapper tool/CLI to decide things such as how to serialize (csv vs yaml vs json vs rdf vs insert into a SQL database...) which error or severity types to filter.
@matentzn is about to start using this in anger with Anita and their experience will help feed into our design here.
Of note, the OAK implementation will also perform "repair" coercion operations. See RepairOperation. Not all constraint violations can be repaired, and it is not always desirable to do the repair - sometimes it is better not to follow Postel's law and "fail fast"