linkml / linkml

Linked Open Data Modeling Language
https://linkml.io/linkml
Other
325 stars 101 forks source link

Provide unified validation and repair/coerce behavior #911

Open cmungall opened 2 years ago

cmungall commented 2 years ago

This is an "epic" ticket that attempts to unify a number of other separate issues and come up with a coherent solution

This first comment is about the background the second comment immediiately succeeding this is the approach

BACKGROUND:

Currently there are a number of different validation strategies. These are currently all implemented:

  1. Use linkml-validate, which is a part of the main linkml distribution. This uses jsonschema validator behind the scenes
  2. Use https://github.com/linkml/linkml-validator, written by @deepakunni3
  3. Convert a schema to SPARQL queries and run these over an RDF representation of the data
  4. Convert a schema to an external representation for framework X, and use X tools to validate
    • Example: convert a schema to SHACL and use pyshacl
    • Example: convert a schema to ShEx and use pyshex

There are a few issues with the existing validator (see #892, #891)

Additionally, when using the loader framework to instantiate dataclasses, validation at the python level will be performed. Complicating the picture, the current dataclass generation framework creates classes that perform silent coercion to correct data. While this follow's Postel's law it can be confusing, as it means that any validator what uses python objects as an intermediate will not detect classes of error that are silently repaired/coerced/fixed.

@deepakunni3 wrote design principles for validation here https://linkml.io/linkml-validator/design/

These are great, I will regurgitate verbatim here:

I agree with all of these.

Rather than "parseable validation messages" I would instead say that the validator should produce ValidationResult objects that conform to the standard validation data model. See validation.yaml. See also #368

The validation data model is not very well exposed at the moment, so I will describe it briefly here. It follows closely the SHACL validation shape model, and it uses the same vocabulary, e.g. http://www.w3.org/ns/shacl#MinCountConstraintComponent (see https://www.w3.org/TR/shacl/#core-components-shape).

OAK uses an extension of this datamodel we should fold back in to LinkML

Although the OAK implementation is hardwired for sqlite and OMO I think it is instructive to look at it and combine it with @deepakunni3's design. In particular of note is that the validator produces ValidationResult objects, which are typed with both severity and constraint violation type, and it is up to the wrapper tool/CLI to decide things such as how to serialize (csv vs yaml vs json vs rdf vs insert into a SQL database...) which error or severity types to filter.

@matentzn is about to start using this in anger with Anita and their experience will help feed into our design here.

Of note, the OAK implementation will also perform "repair" coercion operations. See RepairOperation. Not all constraint violations can be repaired, and it is not always desirable to do the repair - sometimes it is better not to follow Postel's law and "fail fast"

cmungall commented 2 years ago

APPROACH:

Given the above, what is our approach?

First there should be only one 'place' to do validation. We should either:

  1. fold the linkml-validator repo in to linkml, OR
  2. we should deprecate the existing linkml-validate command and have people install linkml-validator
  3. refactor such that link-validator is a dependency of linkml, rather than vice versa

Approach 1 is probably quite easy for users, but has the downside that linkml-validator is not visible as a distinct product in itself. There is some value in having it as distinct with its own documentation, credit, evolution, ...

Approach 2 is maybe quite confusing for users. The dependency chain is currently

linkml-validator --[depends on]--> linkml --[depends_on]--> linkml_runtime

Currently we have a large number of repos that have been set up with standard templates that make linkml a developer dependency and runtime a runtime dependency. If these repos want to take advantage of validation then they need to add an extra dependency.

Approach 3 may be desirable but not sure if it's possible. Validation inherently takes advantage of generators which are in linkml. I suppose the core validator framework could be in its own repo, and plugins may depend on linkml.

Overall I am leaning towards 1 but we would make sure that @deepakunni3 is OK with this and credits/history were sufficiently incorporated

matentzn commented 2 years ago

Ticket plan looks great.

deepakunni3 commented 2 years ago

This is a great issue to get up to speed on what has happened so far in terms of validation in linkml 🎉

@cmungall I agree with your comments above and have a few thoughts I wanted to highlight, as I try to recall the points that led me to write linkml-validator.

Validation Vocabulary

Recommendation: It would be great to explore and see if we can merge the validation schema in linkml and linkml-validator so that we accommodate a wide range of use-cases.

linkml-validate

Recommendation: It would be great if we can explore how to refactor the linkml.validators submodule. Ideally, the linkml-validator should be able to fit in this submodule.

Happy to take a stab at this in the coming weeks :)