Provide unified validation and repair/coerce behavior

cmungall commented 2 years ago

This is an "epic" ticket that attempts to unify a number of other separate issues and come up with a coherent solution

This first comment is about the background the second comment immediiately succeeding this is the approach

BACKGROUND:

Currently there are a number of different validation strategies. These are currently all implemented:

Use linkml-validate, which is a part of the main linkml distribution. This uses jsonschema validator behind the scenes
Use https://github.com/linkml/linkml-validator, written by @deepakunni3
Convert a schema to SPARQL queries and run these over an RDF representation of the data
Convert a schema to an external representation for framework X, and use X tools to validate
- Example: convert a schema to SHACL and use pyshacl
- Example: convert a schema to ShEx and use pyshex

There are a few issues with the existing validator (see #892, #891)

Additionally, when using the loader framework to instantiate dataclasses, validation at the python level will be performed. Complicating the picture, the current dataclass generation framework creates classes that perform silent coercion to correct data. While this follow's Postel's law it can be confusing, as it means that any validator what uses python objects as an intermediate will not detect classes of error that are silently repaired/coerced/fixed.

@deepakunni3 wrote design principles for validation here https://linkml.io/linkml-validator/design/

These are great, I will regurgitate verbatim here:

Schema Agnostic: The validator must be schema agnostic and must not make any assumptions of the data outside of the scope of LinkML metamodel. i.e. The validator runs for any linkml schema
Plugin Architecture: Each type of validation is its own plugin. So any subsequent vendor-specific or technology-specific validation scenarios must be its own plugin. For example, JsonSchemaValidationPlugin performs JSONSchema validation on one or more objects. There are two types of plugins that are supported:
- in-house plugins: These are plugins that are defined in the linkml-validator repository
- external plugins: These are plugins that live outside of the linkml-validator repository. But one main key characteristic of these external plugins is that they implement the linkml-validator's BasePlugin abstract class. This is to ensure that the external plugin class is compatible and plays well with the validator and other plugins
Easy to Configure: Each plugin is instantiated with default arguments but these arguments can be overridden by providing the arguments at runtime
Flexible Generators: Plugins use default generators provided by LinkML. But plugins can also provide ways to accept custom generators. This ensures that if a project is using an extended version of a LinkML generator then it is possible with existing plugins
Parseable Validation Messages: The Validator returns easy to parse validation messages with a defined structure.
- Each time a validation is run on an object, the Validator returns a ValidationReport for that object
- Each ValidationReport has one (or more) ValidationResult, where each ValidationResult is from one (or more) plugin
- Each ValidationResult has one (or more) ValidationMessage (a structured message that describes the validation error)

I agree with all of these.

schema agnostic: definitely
plugin architecture: The plugin architecture is particularly important to avoid dependency bloat. It is impossible for there to be any single validator implementation as different usage contexts will drive different approaches. Perhaps my data is in a giant bigtable database. It will not be performant to export this to json or to convert into a giant in-memory object
- For the specifics of how the plugin arch would work, I think the existing linkml-validator repo is great We are also wresting with this in OAK: https://github.com/INCATools/ontology-access-kit/issues/171

Rather than "parseable validation messages" I would instead say that the validator should produce ValidationResult objects that conform to the standard validation data model. See validation.yaml. See also #368

The validation data model is not very well exposed at the moment, so I will describe it briefly here. It follows closely the SHACL validation shape model, and it uses the same vocabulary, e.g. http://www.w3.org/ns/shacl#MinCountConstraintComponent (see https://www.w3.org/TR/shacl/#core-components-shape).

OAK uses an extension of this datamodel we should fold back in to LinkML

Although the OAK implementation is hardwired for sqlite and OMO I think it is instructive to look at it and combine it with @deepakunni3's design. In particular of note is that the validator produces ValidationResult objects, which are typed with both severity and constraint violation type, and it is up to the wrapper tool/CLI to decide things such as how to serialize (csv vs yaml vs json vs rdf vs insert into a SQL database...) which error or severity types to filter.

@matentzn is about to start using this in anger with Anita and their experience will help feed into our design here.

Of note, the OAK implementation will also perform "repair" coercion operations. See RepairOperation. Not all constraint violations can be repaired, and it is not always desirable to do the repair - sometimes it is better not to follow Postel's law and "fail fast"

cmungall commented 2 years ago

APPROACH:

Given the above, what is our approach?

First there should be only one 'place' to do validation. We should either:

fold the linkml-validator repo in to linkml, OR
we should deprecate the existing linkml-validate command and have people install linkml-validator
refactor such that link-validator is a dependency of linkml, rather than vice versa

Approach 1 is probably quite easy for users, but has the downside that linkml-validator is not visible as a distinct product in itself. There is some value in having it as distinct with its own documentation, credit, evolution, ...

Approach 2 is maybe quite confusing for users. The dependency chain is currently

linkml-validator --[depends on]--> linkml --[depends_on]--> linkml_runtime

Currently we have a large number of repos that have been set up with standard templates that make linkml a developer dependency and runtime a runtime dependency. If these repos want to take advantage of validation then they need to add an extra dependency.

Approach 3 may be desirable but not sure if it's possible. Validation inherently takes advantage of generators which are in linkml. I suppose the core validator framework could be in its own repo, and plugins may depend on linkml.

Overall I am leaning towards 1 but we would make sure that @deepakunni3 is OK with this and credits/history were sufficiently incorporated

matentzn commented 2 years ago

Ticket plan looks great.

deepakunni3 commented 2 years ago

This is a great issue to get up to speed on what has happened so far in terms of validation in linkml 🎉

@cmungall I agree with your comments above and have a few thoughts I wanted to highlight, as I try to recall the points that led me to write linkml-validator.

Validation Vocabulary

The ValidationReport and ValidationResult objects in linkml-validator were inspired by the validation vocabulary from linkml. But it wasn't used as-is because it is catered to RDF and SHACL and thus uses keys/slots that are RDF specific.
- ValidationResult has subject, predicate, object slots which are not relevant when you are validating with JSONSchema. One could always force the JSONSchema reports into this pattern, but it maybe confusing to the typical developer who may not see data as a graph with S-P-O triples
The validation model is not flexible for generic messages that may not fit the given model
- In linkml-validator we have ValidationResult where each ValidationResult has one or more ValidationMessage. Each ValidationMessage is a unit of error specific to an object. In this way, a ValidationResult is a collection of validation messages from a validation operation (JSONSchema validation, shacl validation, custom hand-coded validation). This may not be perfect, but it still allows for supporting use-cases that may not necessarily be in the scope of linkml, but allows for supporting a list of ever growing use-cases.
I also see that there is an OAK extension to the Linkml validation vocabulary
- We should collect the requirements and see how all three can be merged/folded into a generic and permissible validation model

Recommendation: It would be great to explore and see if we can merge the validation schema in linkml and linkml-validator so that we accommodate a wide range of use-cases.

linkml-validate

The linkml-validate cli utility points to linkml.validators.jsonschemavalidator:cli; And there is also linkml-jsonschema-validate and linkml-sparql-validate
The linkml-validate is limited when it comes to validating large amounts of JSON data
The results are not parseable since its just providing the exact errors that pyjsonschema is returning

Recommendation: It would be great if we can explore how to refactor the linkml.validators submodule. Ideally, the linkml-validator should be able to fit in this submodule.

Happy to take a stab at this in the coming weeks :)

linkml / linkml

Provide unified validation and repair/coerce behavior #911

Validation Vocabulary

linkml-validate