New Operation: report - Githubissues

cmungall commented 6 years ago

ROBOT performs logical checks using reason and can run a custom suite of SPARQL queries using verify. Some repos successfully use this in combination with Makefiles and a custom suite of queries to implement robust checks in their pipeline (e.g. GO; and also OSK).

We should make it easier for groups to do this out of the box. While customization is good, ROBOT should still provide an OBO-ish core set of queries and reports that could be used by OBO central for evaluation purposes. The groups could still customize and say for example "allow some classes to lack definitions".

In GO we are still very reliant on this script, it depends on obo format but it is a good model for a set of potential standard checks: https://github.com/geneontology/go-ontology/blob/master/src/util/check-obo-for-standard-release.pl

This could be implemented by robot distributing some sparql in its resources folder (or a standard obo place to fetch these) and/or procedural code (sometimes easier than sparql)

cmungall commented 6 years ago

Proposed name for this: report

the output product could be called a "report card"

cmungall commented 6 years ago

Should this command also produce standardized exports of things like tables of all classes and their labels, or is this overloading? Would that belong in convert?

jamesaoverton commented 6 years ago

I've used SPARQL for this. It's important to distinguish internal from external (imported) terms.

I'm fine with including this in report.

cmungall commented 6 years ago

Some suggestions on potential output. For checks, these could be treated as informative for now. Working with obo-ops and the obo community we could come up with varying levels of stringency. For some we can imagine different profiles, with the profile (e.g. obo-basic) declared in header.

general reporting

aggregate statistics
- number/proportion of classes with textual definitions
- number/proportion of classes with textual definitions that have/lack provenance
- number/proportion of classes with owl definitions
- number/proportion of classes with synonyms

checks

cardinality violations on annotation properties
- rdfs:label
- text defs
uniqueness
- rdfs:label
- text defs
- exact syns+labels
logical checks
- coherent (we get this from reason already but nice to bundle together)
- no equivalent class pairs in main ontology
lexical checks (many can be repaired)
- trailing whitespace in literals
- text defs first character capitalized
- text defs end in . [OR: none end in .]
- no illegal characters
- most APs (label, syn, def) should never have newlines or formatting characters
- CURIEs (e.g value of has_dbxref) should have declared prefix and be valid CURIEs
structural checks
- deprecated classes should not be referenced in logical axioms
- each non-root class should have minimally one superclass [released/reasoned version only]
ontology header
- metadata complete
- ontology IRI and versionIRI conform to standards
- deprecated classes should have rdfs:label starting with 'obsolete'
style
- classes with >1 subClassOf to named class (editor version only)

https://github.com/geneontology/go-ontology/blob/master/src/util/check-obo-for-standard-release.pl

cmungall commented 6 years ago

currently the GO reports give a lot of reference violations deep in the import chain. We know this is the case so would rather have an option to suppress reporting reference violations, @dougli1sqrd

cmungall commented 6 years ago

An example of how this is working (all on #231 at the moment).

Run report on an ontology:

$ robot report -i go-edit.obo -o report.yaml
ERROR REPORT FAILED! Violations: 18016
ERROR Severity 1 violations: 18006
ERROR Severity 2 violations: 10
ERROR Severity 3 violations: 0
ERROR Severity 4 violations: 0
ERROR Severity 5 violations: 0

@jamesaoverton - so far -o always means output ontology. Should we use a different arg, or stick with unix convention?

here is a sample of the report. We use the JSON subset of YAML so JSON is trivial too if you want the verbosity. It should also be easy to make nice HTML or Markdown from this, but this is a separate concern.

report is in sections, first reference issues:

problemsReport:
  invalidReferenceViolations:
  - axiom: "SubClassOf(<http://purl.obolibrary.org/obo/UBERON_0001343> ObjectSomeValuesFrom(<http://purl.obolibrary.org/obo/BFO_0000067>\
      \ <http://purl.obolibrary.org/obo/GO_0007126>))"
    referencedObject: "<http://purl.obolibrary.org/obo/GO_0007126>"
    category: "DEPRECATED"
    type: "reference violation"
    description: "SubClassOf(<http://purl.obolibrary.org/obo/UBERON_0001343> ObjectSomeValuesFrom(<http://purl.obolibrary.org/obo/BFO_0000067>\
      \ <http://purl.obolibrary.org/obo/GO_0007126>))"
    severity: 2
  - axiom: "DisjointUnion(Annotation(rdfs:comment \"Every biological process is either\
      \ a single-organism process or a multi-organism, but never both (GOC:mtg_obo_owl_Jan2013)\"\
      ) <http://purl.obolibrary.org/obo/GO_0008150> <http://purl.obolibrary.org/obo/GO_0044699>\
      \ <http://purl.obolibrary.org/obo/GO_0051704> )"
    referencedObject: "<http://purl.obolibrary.org/obo/GO_0044699>"
    category: "DEPRECATED"
    type: "reference violation"
    description: "DisjointUnion(Annotation(rdfs:comment \"Every biological process\
      \ is either a single-organism process or a multi-organism, but never both (GOC:mtg_obo_owl_Jan2013)\"\
      ) <http://purl.obolibrary.org/obo/GO_0008150> <http://purl.obolibrary.org/obo/GO_0044699>\
      \ <http://purl.obolibrary.org/obo/GO_0051704> )"
    severity: 2

next is ontology header issues. Bad GO! Lots of stuff missing from the header. Note that ontology header stuff is nice to do in SPARQL as we can do the same query across all ontologies in the triplestore, but a bit of redundancy is OK for convenience of user here:

  ontologyMetadataViolations:
  - description: "cardinality of dc:creator"
    severity: 1
    type: "ontology metadata violation"
  - description: "cardinality of dc:title"
    severity: 1
    type: "ontology metadata violation"
  - description: "cardinality of dc:description"
    severity: 1
    type: "ontology metadata violation"

then class metadata:

  classMetadataViolations:
  - description: "|<http://purl.obolibrary.org/obo/IAO_0000117>|=0 LESS_THAN 1"
    severity: 1
    cls: "<http://purl.obolibrary.org/obo/CHEBI_60941>"
    type: "class metadata violation"
...
  - description: "|rdfs:label|=0 LESS_THAN 1"
    severity: 1
    cls: "<http://purl.obolibrary.org/obo/GO_0005860>"
    type: "class metadata violation"
...

I think these are false positives, it should report dangling classes or deprecated classes

The YAML directly mirrors the simple POJO we have for the ReportCard, see java for details

jamesaoverton commented 6 years ago

Regarding -o: We're using -O for --output-dir in query.

The key question is chaining, I think. Do you (the user) want to run a chain of commands ending with report, so that report also saves the ontology? If we reuse -o then that won't work.

I think an -r/--report option might be better, but I don't feel strongly about it. We already have larger inconsistencies than this.

jamesaoverton commented 6 years ago

I tried this on OBI Core. It ran quickly and seemed to work correctly. Some thoughts and comments:

Having a machine-readable report is good. Maybe we can have a simple "ignore" sheet.

If we use IRIs everywhere then only a few of us can just read the reports without another layer.

I got items like this:

- description: "|<http://purl.obolibrary.org/obo/IAO_0000115>|=0 LESS_THAN 1"
  severity: 1
  cls: "<http://purl.obolibrary.org/obo/BFO_0000001>"
  type: "class metadata violation"

It's true that BFO entity does not have a definition, but that's not my fault -- it only has an elucidation.

I think that each rule should have a URL and documentation that explains what the problem is and how to fix it.

jamesaoverton commented 6 years ago

At the top of the issue Chris presented two alternatives: declarative and procedural. PR #231 takes a procedural approach, with the rules written in Java. I'd prefer a declarative approach using SPARQL.

Things like InvalidReferenceChecker require Java, and that's fine, but the metadata and cardinality checks in #231 should work fine with SPARQL. When SPARQL gets ugly, we can consider small extensions: https://jena.apache.org/documentation/query/extension.html

I have a bunch of reasons for preferring a declarative approach, but perhaps the most persuasive is that we want a low barrier of entry for reading and writing rules, so that we can get as many developers as possible using and contributing to the rule set. I think that more of our developers can work with SPARQL than Java. Another consideration is that SPARQL is much easier to re-use in other contexts than the Java in #231.

My biggest worry is that SPARQL performance will be much worse than Java performance.

@cmungall: Is it OK for @rctauber and me to build a SPARQL-based alternative to #231 so that we can compare the two approaches? I think we can do it this week.

cmungall commented 6 years ago

Simple cardinality checks should work fine in SPARQL. Some of the logic may be dependent on configuration which is awkward. E.g. if the ontology is in the GO-lineage, the definition axioms should be annotated, if in the OBI-lineage then a definition_editor field should be present. I suppose we can encourage this configuration to go as metadata in the ontology header so this is introspectable in the query but I'd like this to work out the box for all ontologies.

@dougli1sqrd and I started out going a pure SPARQL route for everything, but things like the obsolete reference checks are very difficult. Using extensions seems to defeat the purpose. SHACL/Shex may be promising for some kinds of checks but this doesn't seem advanced enough at this stage.

I agree in principle SPARQL is more declarative, I'm not totally sure that we'd attract more developers. I know @drseb and @pnrobinson for example have already written lots of java code for procedural checks of HPO, are comfortable with this.

Anyway I think it's OK to have an alternative for a subset of checks

cmungall commented 6 years ago

I tried this on OBI Core..

IRIs - yes I'm being lazy and using generic Jackson pass-throughs to turn the java objects into strings for now, none of my users would use this without the labels.

We need to also make sure that MIREOTed entities are not checked - currently it assumes an import strategy. The logic here may get quite complex for what to check and when. As non-declarative as java is, having the ability to define reusable methods/subroutines will help a lot here...

cmungall commented 6 years ago

We should also look at OOPS

http://oops.linkeddata.es/

But many of these patterns are under or over specified for us. E.g.

P17. Overspecializing a hierarchy: The hierarchy in the ontology is specialized in such a way that the final leaves are defined as classes and these classes will not have instances. Authors in [e] provide guidelines for distinguishing between a class and an instance when modeling hierarchies.

Most OBOs don't include instances, these are assigned separately

P06. Including cycles in a class hierarchy

We have a principled check for this

P22. Using different naming conventions in the ontology

We should explicitly check for OBO naming conventions

P36. URI contains file extension: This pitfall occurs if file extensions such as ".owl", ".rdf", ".ttl", ".n3" and ".rdfxml" are included in an ontology URI. This pitfall is related with the recommendations provided in

Oops, every OBO fails this!

P41. No license declared: The ontology metadata omits information about the license that applies to the ontology

we agree with this one!

cmungall commented 6 years ago

@jamesaoverton @rctauber @dougli1sqrd @kltm in terms of documenting the checks what about

option A

one yaml file per check
yaml lives in robot-core resources (alternate: obo repo)
yaml includes
- default severity (but this may vary by profile)
- a SPARQL query (optional: some best done in code)
- mandatory detailed description of check
numeric ID for each check as in OOPS

option B

one yamldown file per check
lives in obo repo (rendered automatically by obo jekyll framework)
embedded yaml block with metadata about check
markdown formatted description of check
numeric ID

The advantage of B is that we can treat this more as documentation, include sections with headers, examples, etc. This would be very much like the gorules system https://github.com/geneontology/go-site/tree/master/metadata/rules

In either case we could have PURLs for the rules so they could be embedded in RDF metadata descriptions of results, SPARQL constructs etc

kltm commented 6 years ago

As I was tagged, I have a great preference for "A" as it is an easier to parse and more standard format, and document extraction could still trivial been done on a single field. I am not a fan of the GO rules setup.

jamesaoverton commented 6 years ago

I agree that we need all that metadata, that it should be structured, and that rules should have PURLs.

I'm not sure about the file format. In a previous project I used plain SPARQL (and SQL) files for validation rules, and I put the structured metadata (which might as well be YAML) and a longer description as a comment at the top. So the focus was on having a file that you could just run as a query. Call that option C. @rctauber is working on some code to compare with #231, as we discussed above, and it will work this way.

I don't have a strong preference among A,B,C at the moment. It will be easy enough to convert any of them to nice HTML pages.

cmungall commented 6 years ago

The decision between A/B/C in part depends on what the goal is here. Is it providing a specification or an implementation+metadata? What I think we need here is a specification. Not everything can be implemented in SPARQL, and we may want different implementations for different contexts. Even with trivial SPARQL, the query may have to be modified e.g. to run on a multi-ontology triplestore.

For a specification, things like formatting and ease of editing/PRs by a variety of stakeholders of varying degrees of technical expertise is of utmost importance. Requiring a microlibrary for parsing yamldown seems like minimal imposition on developers. However, editing SPARQL inside any kind of embedding is super-annoying. Same is true to some extent for any complex nested yaml, but I think this would be fairly minimal flat yaml.

Another option would be decoupling, and manage each as a folder/dir, each minimally containing a markdown description file and a metadata file:

checks/
  001/
    index.md
    meta.yaml
    default.rq
    alt.rq

This could be seamlessly dropped into the obo jekyll framework but comes with its own annoyances.

jamesaoverton commented 6 years ago

@cmungall I definitely had implementation+metadata in mind, not just a specification. That's part of the reason I've been pushing for SPARQL over Java, because SPARQL can be used in more contexts. I could be convinced that we need a layer of specifications over a layer of implementations, but I worry that the specification will be vague and that its implementations will drift apart.

@rctauber pushed some work here: https://github.com/rctauber/robot/commit/e2633061bf57c1d776d9e83fd17cf6849b2e52ad

The ReportOperation mostly just loads SPARQL queries, executes them, and reports problems. Each query has "title" and "severity" as structured metadata in comments. This is just a first draft, for discussion. The structured output in #231 is a good idea, but this draft doesn't implement that.

Since my main worry is performance, I'd like to compare this to #231. What would be a good test case?

cmungall commented 6 years ago

Cool. Can we test it on a SPARQL query for checking for references to dangling or obsolete classes?

jamesaoverton commented 6 years ago

@rctauber has run some initial performance comparisons. As expected, SPARQL is slower and takes more memory, but it's not as bad as I feared.

For OBI, ECO, and GO, the SPARQL implementation takes about 2-3 times as long to run as the Java.
With both OWLAPI and SPARQL representations in memory, we need about 100% more memory for OBI and ECO, but just 25% more memory for GO.

That's all with Jena/Fuseki. We're going to try Blazegraph now.

cmungall commented 6 years ago

You can ping @yy20716 if you want help optimizing queries

On 26 Jan 2018, at 6:17, James A. Overton wrote:

@rctauber has run some initial performance comparisons. As expected, SPARQL is slower and takes more memory, but it's not as bad as I feared.

For OBI, ECO, and GO, the SPARQL implementation takes about 2-3 times as long to run as the Java.

With both OWLAPI and SPARQL representations in memory, we need about 100% more memory for OBI and ECO, but just 25% more memory for GO.

That's all with Jena/Fuseki. We're going to try Blazegraph now.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/ontodev/robot/issues/205#issuecomment-360796089

jamesaoverton commented 6 years ago

After a couple of hours trying things out, Blazegraph was consistently slower and used more memory than Jena/Fuseki for this task. That's a bit of a surprise for me, since I usually find Blazegraph performs better. But in this case we're doing a lot of writes and not so many reads, so maybe that explains the difference. @rctauber will push her Blazegraph version to her repo, so you can see if we made any obvious mistakes.

Based on these tests, I think that the performance hit for SPARQL is acceptable. What does everyone else think?

cmungall commented 6 years ago

Do you have rough absolute times? I forget how long the java takes

beckyjackson commented 6 years ago

The times are an average of 3 executions via command line. I killed the GO execution with Blazegraph after ~180 seconds.

	ECO	OBI	GO
Java	1.44	2.54	20.43
SPARQL	3.19	5.91	51.80
Blazegraph	5.32	11.57	???

The most recent change to Blazegraph is here (using Futures): rctauber/robot@70ba125 I forgot to push the dependency so I added that here: rctauber/robot@49b28eb

cmungall commented 6 years ago

@yy20716 will take a look at this to see if there are obvious ways to improve the blazegraph performance, and if there are obvious ways to optimize in general

yy20716 commented 6 years ago

@rctauber, if you don't mind, could you please let us know how we can reproduce your results in the table? I cloned your "reporting" branch but wonder how you switched among Java, SPARQL, and Blazegraph for testing this reporting feature. If you have a set of commands for testing these options, could you please let us know? Thank you.

beckyjackson commented 6 years ago

Hi @yy20716 - I build 3 separate JARs to test each of the methods:

Java from Chris's detailed-reporting branch
Blazegraph from my reporting branch
SPARQL with Jena from a previous commit on my reporting branch

If you'd like me to send you the JARs, I'd be happy to!

yy20716 commented 6 years ago

Hello @rctauber, it seems that the use of newCachedThreadPool() in getViolation rather leads to the suboptimal performance in the case of Blazegraph. In my machine, the Blazegraph implementation was later failed due to the OOM (after eating up 16GB memory). I also tried with newFixedThreadPool(8), but It's rather faster when I limited the number of the thread as 1, i.e., newFixedThreadPool(1).

I also observed that the use of nested FILTERs with NOT EXIST causes issues. Here is one of the queries that take several minutes in Blazegraph (even if we only executed this query without using any threading). From my understanding, this query checks whether the punctuation exists at the end of the object value described with IOA_0000115.

SELECT DISTINCT ?entity ?property ?value WHERE
  {VALUES ?property {obo:IAO_0000115}
   ?entity ?property ?value .
   FILTER NOT EXISTS {FILTER STRENDS(str(?value), ".")}}

The problem is that the input graph pattern is not explicitly given for the NOT EXISTS clause, so most query parsers interpret the clauses as the operation that applies the filter for all triples, which is the ineffective operation. These filtered triples that do not contain dots are then joined with triples whose properties are obo:IAO_0000115; thus it is likely to take a significant amount of processing time. It seems that Jena somehow optimized these operations but Blazegraph processed the query based on the initial interpretation. If my interpretation is correct, this query can be simplified as follows.

SELECT DISTINCT ?entity ?property ?value WHERE
  {VALUES ?property {obo:IAO_0000115}
   ?entity ?property ?value .
   FILTER (!STRENDS(str(?value), "."))}

There were other similar queries, so I modified them as well. This modification reduced the execution time up to approx. 2 mins for processing all queries over go-edit.obo. I also modified other queries such as reducing the number of union branches as much as I can. I forgot to rewrite the query formatted-annotation-violation.rq but it can be further optimized by merging UNIONs and replacing REGEX into CONTAINS with OR. All these modifications will probably also optimizes Jena's execution in some degree, but I didn't compare the performance yet.

I made a pull request at your Github repository and would appreciate it if you could check whether the modification makes sense to you. Thank you.

beckyjackson commented 6 years ago

Thanks so much @yy20716 - I merged that PR and built with the changes. The modifications do make sense, and I appreciate the explanation.

Here's the new (your changes) vs. the old Blazegraph times:

	Old	New
ECO	5.32	5.01
OBI	11.57	8.32
GO	???	92.39

GO actually completed, which is a huge improvement. But, Jena still processes in ~40% less time.

yy20716 commented 6 years ago

I am really glad that it worked. Yes, it is well known that Jena is generally faster than other approaches including Blazegraph when the size of input data is small, i.e. up to 100-200MB data. Most ontologies are not that large (i.e. smaller than 200MB), but Jena's performance quickly becomes worse once the size of data becomes larger. I am not sure but maybe it would be safe to use Blazegraph if the processing time is still acceptable for you or other members of this project. If not, Jena would be a reasonable choice. Thank you.

jamesaoverton commented 6 years ago

@cmungall Do you want to move forward with the Java approach (#231) or the SPARQL approach (https://github.com/rctauber/robot/tree/reporting)?

cmungall commented 6 years ago

SPARQL

cmungall commented 6 years ago

Nothing makes me happier than killing verbose java code I have written. Closed the PR. Declarative all the way

May want to copy this bit of docs back: https://github.com/ontodev/robot/pull/231/files#diff-20d33daf2a298665aebca3d66cf9771aR103

At the moment master is using some procedural checks to warn for dangling axioms that could lead to reasoner incompleteness. This could probably all be moved over later to SPARQL tool

beckyjackson commented 6 years ago

Do we want to leave this open to continue discussing the checks (since this feature is implemented now), or create a new discussion issue (or mailing list convo?)?

If we moved discussion, it could subsume #217 as well.

jamesaoverton commented 6 years ago

We should have more focused discussions on particular rules from now on.

ontodev / robot

New Operation: report #205