shexSpec / shex

ShEx language issues, including new features for e.g. ShEx2.1
23 stars 8 forks source link

How to process/generate ShEx reports #115

Open andrawaag opened 3 years ago

andrawaag commented 3 years ago

I am applying ShEx validation on a large scale on Wikidata items, but I am struggling to aggregate the results in a sensible report. I am looking for best practices here. This is the approach I have followed so far, which is working for this specific use case.

Take for example the following use case:

I have developed a script using WikidataIntegrator and PyShEx to do the validation. These are the steps:

The current aggregated report is sufficient for its task, i.e. where are the issues. But getting there requires some suboptimal parsing of output of strings and some arbitrary clustering on types of errors.

{'No matching triples found for predicate p:P2888': 6608, 'No matching triples found for predicate ps:P279': 6217, '2 triples exceeds max {1,1}': 3666, 'No matching triples found for predicate prov:wasDerivedFrom': 2632, '{"values": ["http://www.wikidata.org/entity/Q5282129"], "typ...': 534, '{"values": ["http://www.wikidata.org/entity/Q27468140"], "ty...': 1304, 'No matching triples found for predicate pr:P699': 772, '3 triples exceeds max {1,1}': 9, 'No matching triples found for predicate pr:P5270': 1}

I am looking for:

  1. suggestion to improve the pipeline/alternatives
  2. a standard output from shex validation pipelines from which reports can be generated. For example can there be a finite set of error types? e.g "No matching triples, cardinality issue", etc.
andrawaag commented 3 years ago

@hsolbrig @ericprud @labra If I am not mistaken you have implemented various output detailing possible errors in a ShEx validation. Are there a finite number of possible errors? Would you mind listing the type of errors listed in your solutions?

goodb commented 3 years ago

Hi @andrawaag when faced with a similar situation for the gene ontology folks, I opened up the java shex so I could access failure information directly, rather than parsing output files. You might have more luck that way as the information you need is by definition in there somewhere.

For that project, I returned 1) the node(s) that failed, 2) the properties that disagreed with the schema applied. I think this was enough to make a start on human readable error reporting, but maybe touch base with team GO to see how that is going now. The hard part we didn't really work on yet was the error cascade - e.g. when one node fails and then that failure causes another node to fail etc.. In those models, which are pretty small, human understanding can usually find the root of a problem but this will be an issue over larger models.

Work was related to this issue: https://github.com/geneontology/minerva/issues/212