linkml / linkml

Linked Open Data Modeling Language
https://linkml.io/linkml
Other
311 stars 98 forks source link

incorrect linkml-validation error message #2146

Open aclum opened 3 months ago

aclum commented 3 months ago

Describe the bug The error message describes an issue with the incorrect slot.

To reproduce Steps to reproduce the behavior: poetry run linkml-validate -s ../../../project/nmdc_materialized_patterns.yaml Database-Extraction-extraction_method-slot-retired.yaml

where the materialized pattern version of the schema comes from https://github.com/microbiomedata/berkeley-schema-fy24/tree/2046-Database-slot-updates and Database-Extraction-extraction_method-slot-retired.yaml is

material_processing_set:
  - id: nmdc:extrp-99-abcdef
    type: nmdc:Extraction
    name: DNA extraction of NEON sample WREF_072-O-20190618-COMP
    description: DNA extraction of NEON sample WREF_072-O-20190618-COMP using SOP BMI_dnaExtractionSOP_v7
    has_input:
      - nmdc:bsm-11-24vb2d
    has_output:
      - nmdc:procsm-11-sdt3
    processing_institution: Battelle
    protocol_link:
      type: nmdc:Protocol
      name: BMI_dnaExtractionSOP_v7
      url: https://data.neonscience.org/documents/10179/2431540/BMI_dnaExtractionSOP_v7/61204962-bb01-a0b9-3354-ccdaab5132c3
    start_date: "2019-11-08"
    end_date: "2019-11-08"
    qc_status: pass
    extraction_method: phenol/chloroform extraction # not allowed anymore
    extraction_target: DNA

The test fails but for the wrong reason, it complains about the length of has_input Expected behavior The error should say 'extraction_method' doesn't exist or a generic error instead of saying a valid slot is invalid.

Screenshots If applicable, add screenshots to help explain your problem.

About your computer (if applicable, please complete the following information):

Additional context cc @turbomam

pkalita-lbl commented 3 months ago

This is a bit of brain dump of what's going on here.

  1. In the schema the material_processing_set slot has a range of MaterialProcessing, and MaterialProcessing is an abstract class with a number of subclasses (9 of them by my count) including ones called Pooling and Extraction.
  2. At the JSON Schema level, this translates into a subschema that looks like this:
    "material_processing_set": {
    "description": "This property links a database object to the set of material processing within it.",
    "items": {
        "anyOf": [
            {
                "$ref": "#/$defs/Pooling"
            },
            {
                "$ref": "#/$defs/Extraction"
            },
            {
                "$ref": "#/$defs/LibraryPreparation"
            },
            {
                "$ref": "#/$defs/SubSamplingProcess"
            },
            {
                "$ref": "#/$defs/MixingProcess"
            },
            {
                "$ref": "#/$defs/FiltrationProcess"
            },
            {
                "$ref": "#/$defs/ChromatographicSeparationProcess"
            },
            {
                "$ref": "#/$defs/DissolvingProcess"
            },
            {
                "$ref": "#/$defs/ChemicalConversionProcess"
            }
        ]
    },
    "type": "array"
    },
  3. Because the data instance provided is (by design) invalid the JSON Schema implementation we use under the hood needs to iterate through each of the subschemas in anyOf and verify that the instance is not valid under each of them. That is indeed what happens. We end up with a whole pile of errors indicating why the instance isn't valid under each anyOf subschema. For example, it rules out the third subschema ("$ref": "#/$defs/LibraryPreparation") because of the id slot (String 'nmdc:extrp-99-abcdef' does not match regex pattern '^(nmdc):libprp-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$'.). Here is the whole pile of reasons why the instance isn't valid:
    
    Message: String 'nmdc:extrp-99-abcdef' does not match regex pattern '^(nmdc):chcpr-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$'.
    Schema path: https://w3id.org/nmdc/nmdc#/$defs/ChemicalConversionProcess/properties/id/pattern

Message: String 'nmdc:extrp-99-abcdef' does not match regex pattern '^(nmdc):dispro-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$'. Schema path: https://w3id.org/nmdc/nmdc#/$defs/DissolvingProcess/properties/id/pattern

Message: String 'nmdc:extrp-99-abcdef' does not match regex pattern '^(nmdc):cspro-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$'. Schema path: https://w3id.org/nmdc/nmdc#/$defs/ChromatographicSeparationProcess/properties/id/pattern

Message: String 'nmdc:extrp-99-abcdef' does not match regex pattern '^(nmdc):filtpr-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$'. Schema path: https://w3id.org/nmdc/nmdc#/$defs/FiltrationProcess/properties/id/pattern

Message: String 'nmdc:extrp-99-abcdef' does not match regex pattern '^(nmdc):mixpro-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$'. Schema path: https://w3id.org/nmdc/nmdc#/$defs/MixingProcess/properties/id/pattern

Message: String 'nmdc:extrp-99-abcdef' does not match regex pattern '^(nmdc):subspr-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$'. Schema path: https://w3id.org/nmdc/nmdc#/$defs/SubSamplingProcess/properties/id/pattern

Message: String 'nmdc:extrp-99-abcdef' does not match regex pattern '^(nmdc):libprp-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$'. Schema path: https://w3id.org/nmdc/nmdc#/$defs/LibraryPreparation/properties/id/pattern

Message: String 'nmdc:extrp-99-abcdef' does not match regex pattern '^(nmdc):poolp-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$'. Schema path: https://w3id.org/nmdc/nmdc#/$defs/Pooling/properties/id/pattern

Message: Value "nmdc:Extraction" is not defined in enum. Schema path: https://w3id.org/nmdc/nmdc#/$defs/ChemicalConversionProcess/properties/type/enum

Message: Value "nmdc:Extraction" is not defined in enum. Schema path: https://w3id.org/nmdc/nmdc#/$defs/DissolvingProcess/properties/type/enum

Message: Value "nmdc:Extraction" is not defined in enum. Schema path: https://w3id.org/nmdc/nmdc#/$defs/ChromatographicSeparationProcess/properties/type/enum

Message: Value "nmdc:Extraction" is not defined in enum. Schema path: https://w3id.org/nmdc/nmdc#/$defs/FiltrationProcess/properties/type/enum

Message: Value "nmdc:Extraction" is not defined in enum. Schema path: https://w3id.org/nmdc/nmdc#/$defs/MixingProcess/properties/type/enum

Message: Value "nmdc:Extraction" is not defined in enum. Schema path: https://w3id.org/nmdc/nmdc#/$defs/SubSamplingProcess/properties/type/enum

Message: Value "nmdc:Extraction" is not defined in enum. Schema path: https://w3id.org/nmdc/nmdc#/$defs/LibraryPreparation/properties/type/enum

Message: Value "nmdc:Extraction" is not defined in enum. Schema path: https://w3id.org/nmdc/nmdc#/$defs/Pooling/properties/type/enum

Message: Array item count 1 is less than minimum count of 2. Schema path: https://w3id.org/nmdc/nmdc#/$defs/Pooling/properties/has_input/minItems

Message: Property 'extraction_method' has not been defined and the schema does not allow additional properties. Schema path: https://w3id.org/nmdc/nmdc#/$defs/ChemicalConversionProcess/additionalProperties

Message: Property 'extraction_method' has not been defined and the schema does not allow additional properties. Schema path: https://w3id.org/nmdc/nmdc#/$defs/DissolvingProcess/additionalProperties

Message: Property 'extraction_method' has not been defined and the schema does not allow additional properties. Schema path: https://w3id.org/nmdc/nmdc#/$defs/ChromatographicSeparationProcess/additionalProperties

Message: Property 'extraction_method' has not been defined and the schema does not allow additional properties. Schema path: https://w3id.org/nmdc/nmdc#/$defs/FiltrationProcess/additionalProperties

Message: Property 'extraction_method' has not been defined and the schema does not allow additional properties. Schema path: https://w3id.org/nmdc/nmdc#/$defs/MixingProcess/additionalProperties

Message: Property 'extraction_method' has not been defined and the schema does not allow additional properties. Schema path: https://w3id.org/nmdc/nmdc#/$defs/SubSamplingProcess/additionalProperties

Message: Property 'extraction_method' has not been defined and the schema does not allow additional properties. Schema path: https://w3id.org/nmdc/nmdc#/$defs/LibraryPreparation/additionalProperties

Message: Property 'extraction_method' has not been defined and the schema does not allow additional properties. Schema path: https://w3id.org/nmdc/nmdc#/$defs/Extraction/additionalProperties

Message: Property 'extraction_method' has not been defined and the schema does not allow additional properties. Schema path: https://w3id.org/nmdc/nmdc#/$defs/Pooling/additionalProperties

Message: Property 'extraction_target' has not been defined and the schema does not allow additional properties. Schema path: https://w3id.org/nmdc/nmdc#/$defs/ChemicalConversionProcess/additionalProperties

Message: Property 'extraction_target' has not been defined and the schema does not allow additional properties. Schema path: https://w3id.org/nmdc/nmdc#/$defs/DissolvingProcess/additionalProperties

Message: Property 'extraction_target' has not been defined and the schema does not allow additional properties. Schema path: https://w3id.org/nmdc/nmdc#/$defs/ChromatographicSeparationProcess/additionalProperties

Message: Property 'extraction_target' has not been defined and the schema does not allow additional properties. Schema path: https://w3id.org/nmdc/nmdc#/$defs/FiltrationProcess/additionalProperties

Message: Property 'extraction_target' has not been defined and the schema does not allow additional properties. Schema path: https://w3id.org/nmdc/nmdc#/$defs/MixingProcess/additionalProperties

Message: Property 'extraction_target' has not been defined and the schema does not allow additional properties. Schema path: https://w3id.org/nmdc/nmdc#/$defs/SubSamplingProcess/additionalProperties

Message: Property 'extraction_target' has not been defined and the schema does not allow additional properties. Schema path: https://w3id.org/nmdc/nmdc#/$defs/LibraryPreparation/additionalProperties

Message: Property 'extraction_target' has not been defined and the schema does not allow additional properties. Schema path: https://w3id.org/nmdc/nmdc#/$defs/Extraction/additionalProperties

Message: Property 'extraction_target' has not been defined and the schema does not allow additional properties. Schema path: https://w3id.org/nmdc/nmdc#/$defs/Pooling/additionalProperties

4.  So what's a poor JSON Schema validator to do? Show all those messages to the user and let them sort it out? That's a bit cruel to the user. So we [call](https://github.com/linkml/linkml/blob/901cbf845b725388a53bbef0f465e7ae0bbd0f52/linkml/validator/plugins/jsonschema_validation_plugin.py#L50) the [utility function](https://python-jsonschema.readthedocs.io/en/stable/api/jsonschema/exceptions/#jsonschema.exceptions.best_match) provided by the JSON Schema implementation to isolate _what it considers_ to be the most relevant error, based on its heuristics. And this case, it deems that this one is the most specific:

Message: Array item count 1 is less than minimum count of 2. Schema path: https://w3id.org/nmdc/nmdc#/$defs/Pooling/properties/has_input/minItems

5. You can see the error that @aclum was looking for in the pile, but unfortunately it wasn't deemed to be the most specific one:

Message: Property 'extraction_method' has not been defined and the schema does not allow additional properties. Schema path: https://w3id.org/nmdc/nmdc#/$defs/Extraction/additionalProperties



So on one hand we have our current approach of "attempt to sift out the best error message and present that to the user". On the other hand you could imagine an option that's like "show me the full pile of errors and I'll sort it out" -- could be useful for debugging. I don't know if there's any clever middle ground between those two. I'll have to think about it more, but it's hard to imagine how we could have surfaced the _one_ error message that @aclum wanted to see in this case.
aclum commented 3 months ago

Is there a way tell linkml-validate to use the value for slot type within an individual record to pick the most relevant error if the Class is Database?

pkalita-lbl commented 3 months ago

One of the thing I was thinking about when I wrote "clever middle ground" in my last message was whether we could use values from a slot with designates_type: true to narrow down the list of relevant error messages. I think that's something like what you're suggesting. But I haven't really dug into the code enough to see how feasible that is.