Closed cmungall closed 6 years ago
Proposed name for this: report
the output product could be called a "report card"
Should this command also produce standardized exports of things like tables of all classes and their labels, or is this overloading? Would that belong in convert
?
I've used SPARQL for this. It's important to distinguish internal from external (imported) terms.
I'm fine with including this in report
.
Some suggestions on potential output. For checks, these could be treated as informative for now. Working with obo-ops and the obo community we could come up with varying levels of stringency. For some we can imagine different profiles, with the profile (e.g. obo-basic) declared in header.
general reporting
checks
repair
ed)
.
[OR: none end in .
]https://github.com/geneontology/go-ontology/blob/master/src/util/check-obo-for-standard-release.pl
currently the GO reports give a lot of reference violations deep in the import chain. We know this is the case so would rather have an option to suppress reporting reference violations, @dougli1sqrd
An example of how this is working (all on #231 at the moment).
Run report on an ontology:
$ robot report -i go-edit.obo -o report.yaml
ERROR REPORT FAILED! Violations: 18016
ERROR Severity 1 violations: 18006
ERROR Severity 2 violations: 10
ERROR Severity 3 violations: 0
ERROR Severity 4 violations: 0
ERROR Severity 5 violations: 0
@jamesaoverton - so far -o
always means output ontology. Should we use a different arg, or stick with unix convention?
here is a sample of the report. We use the JSON subset of YAML so JSON is trivial too if you want the verbosity. It should also be easy to make nice HTML or Markdown from this, but this is a separate concern.
report is in sections, first reference issues:
problemsReport:
invalidReferenceViolations:
- axiom: "SubClassOf(<http://purl.obolibrary.org/obo/UBERON_0001343> ObjectSomeValuesFrom(<http://purl.obolibrary.org/obo/BFO_0000067>\
\ <http://purl.obolibrary.org/obo/GO_0007126>))"
referencedObject: "<http://purl.obolibrary.org/obo/GO_0007126>"
category: "DEPRECATED"
type: "reference violation"
description: "SubClassOf(<http://purl.obolibrary.org/obo/UBERON_0001343> ObjectSomeValuesFrom(<http://purl.obolibrary.org/obo/BFO_0000067>\
\ <http://purl.obolibrary.org/obo/GO_0007126>))"
severity: 2
- axiom: "DisjointUnion(Annotation(rdfs:comment \"Every biological process is either\
\ a single-organism process or a multi-organism, but never both (GOC:mtg_obo_owl_Jan2013)\"\
) <http://purl.obolibrary.org/obo/GO_0008150> <http://purl.obolibrary.org/obo/GO_0044699>\
\ <http://purl.obolibrary.org/obo/GO_0051704> )"
referencedObject: "<http://purl.obolibrary.org/obo/GO_0044699>"
category: "DEPRECATED"
type: "reference violation"
description: "DisjointUnion(Annotation(rdfs:comment \"Every biological process\
\ is either a single-organism process or a multi-organism, but never both (GOC:mtg_obo_owl_Jan2013)\"\
) <http://purl.obolibrary.org/obo/GO_0008150> <http://purl.obolibrary.org/obo/GO_0044699>\
\ <http://purl.obolibrary.org/obo/GO_0051704> )"
severity: 2
next is ontology header issues. Bad GO! Lots of stuff missing from the header. Note that ontology header stuff is nice to do in SPARQL as we can do the same query across all ontologies in the triplestore, but a bit of redundancy is OK for convenience of user here:
ontologyMetadataViolations:
- description: "cardinality of dc:creator"
severity: 1
type: "ontology metadata violation"
- description: "cardinality of dc:title"
severity: 1
type: "ontology metadata violation"
- description: "cardinality of dc:description"
severity: 1
type: "ontology metadata violation"
then class metadata:
classMetadataViolations:
- description: "|<http://purl.obolibrary.org/obo/IAO_0000117>|=0 LESS_THAN 1"
severity: 1
cls: "<http://purl.obolibrary.org/obo/CHEBI_60941>"
type: "class metadata violation"
...
- description: "|rdfs:label|=0 LESS_THAN 1"
severity: 1
cls: "<http://purl.obolibrary.org/obo/GO_0005860>"
type: "class metadata violation"
...
I think these are false positives, it should report dangling classes or deprecated classes
The YAML directly mirrors the simple POJO we have for the ReportCard, see java for details
Regarding -o
: We're using -O
for --output-dir
in query
.
The key question is chaining, I think. Do you (the user) want to run a chain of commands ending with report
, so that report
also saves the ontology? If we reuse -o
then that won't work.
I think an -r
/--report
option might be better, but I don't feel strongly about it. We already have larger inconsistencies than this.
I tried this on OBI Core. It ran quickly and seemed to work correctly. Some thoughts and comments:
Having a machine-readable report is good. Maybe we can have a simple "ignore" sheet.
If we use IRIs everywhere then only a few of us can just read the reports without another layer.
I got items like this:
- description: "|<http://purl.obolibrary.org/obo/IAO_0000115>|=0 LESS_THAN 1"
severity: 1
cls: "<http://purl.obolibrary.org/obo/BFO_0000001>"
type: "class metadata violation"
It's true that BFO entity does not have a definition, but that's not my fault -- it only has an elucidation.
I think that each rule should have a URL and documentation that explains what the problem is and how to fix it.
At the top of the issue Chris presented two alternatives: declarative and procedural. PR #231 takes a procedural approach, with the rules written in Java. I'd prefer a declarative approach using SPARQL.
Things like InvalidReferenceChecker
require Java, and that's fine, but the metadata and cardinality checks in #231 should work fine with SPARQL. When SPARQL gets ugly, we can consider small extensions: https://jena.apache.org/documentation/query/extension.html
I have a bunch of reasons for preferring a declarative approach, but perhaps the most persuasive is that we want a low barrier of entry for reading and writing rules, so that we can get as many developers as possible using and contributing to the rule set. I think that more of our developers can work with SPARQL than Java. Another consideration is that SPARQL is much easier to re-use in other contexts than the Java in #231.
My biggest worry is that SPARQL performance will be much worse than Java performance.
@cmungall: Is it OK for @rctauber and me to build a SPARQL-based alternative to #231 so that we can compare the two approaches? I think we can do it this week.
Simple cardinality checks should work fine in SPARQL. Some of the logic may be dependent on configuration which is awkward. E.g. if the ontology is in the GO-lineage, the definition axioms should be annotated, if in the OBI-lineage then a definition_editor field should be present. I suppose we can encourage this configuration to go as metadata in the ontology header so this is introspectable in the query but I'd like this to work out the box for all ontologies.
@dougli1sqrd and I started out going a pure SPARQL route for everything, but things like the obsolete reference checks are very difficult. Using extensions seems to defeat the purpose. SHACL/Shex may be promising for some kinds of checks but this doesn't seem advanced enough at this stage.
I agree in principle SPARQL is more declarative, I'm not totally sure that we'd attract more developers. I know @drseb and @pnrobinson for example have already written lots of java code for procedural checks of HPO, are comfortable with this.
Anyway I think it's OK to have an alternative for a subset of checks
I tried this on OBI Core..
IRIs - yes I'm being lazy and using generic Jackson pass-throughs to turn the java objects into strings for now, none of my users would use this without the labels.
We need to also make sure that MIREOTed entities are not checked - currently it assumes an import strategy. The logic here may get quite complex for what to check and when. As non-declarative as java is, having the ability to define reusable methods/subroutines will help a lot here...
We should also look at OOPS
But many of these patterns are under or over specified for us. E.g.
Most OBOs don't include instances, these are assigned separately
We have a principled check for this
We should explicitly check for OBO naming conventions
Oops, every OBO fails this!
we agree with this one!
@jamesaoverton @rctauber @dougli1sqrd @kltm in terms of documenting the checks what about
option A
option B
The advantage of B is that we can treat this more as documentation, include sections with headers, examples, etc. This would be very much like the gorules system https://github.com/geneontology/go-site/tree/master/metadata/rules
In either case we could have PURLs for the rules so they could be embedded in RDF metadata descriptions of results, SPARQL constructs etc
As I was tagged, I have a great preference for "A" as it is an easier to parse and more standard format, and document extraction could still trivial been done on a single field. I am not a fan of the GO rules setup.
I agree that we need all that metadata, that it should be structured, and that rules should have PURLs.
I'm not sure about the file format. In a previous project I used plain SPARQL (and SQL) files for validation rules, and I put the structured metadata (which might as well be YAML) and a longer description as a comment at the top. So the focus was on having a file that you could just run as a query. Call that option C. @rctauber is working on some code to compare with #231, as we discussed above, and it will work this way.
I don't have a strong preference among A,B,C at the moment. It will be easy enough to convert any of them to nice HTML pages.
The decision between A/B/C in part depends on what the goal is here. Is it providing a specification or an implementation+metadata? What I think we need here is a specification. Not everything can be implemented in SPARQL, and we may want different implementations for different contexts. Even with trivial SPARQL, the query may have to be modified e.g. to run on a multi-ontology triplestore.
For a specification, things like formatting and ease of editing/PRs by a variety of stakeholders of varying degrees of technical expertise is of utmost importance. Requiring a microlibrary for parsing yamldown seems like minimal imposition on developers. However, editing SPARQL inside any kind of embedding is super-annoying. Same is true to some extent for any complex nested yaml, but I think this would be fairly minimal flat yaml.
Another option would be decoupling, and manage each as a folder/dir, each minimally containing a markdown description file and a metadata file:
checks/
001/
index.md
meta.yaml
default.rq
alt.rq
This could be seamlessly dropped into the obo jekyll framework but comes with its own annoyances.
@cmungall I definitely had implementation+metadata in mind, not just a specification. That's part of the reason I've been pushing for SPARQL over Java, because SPARQL can be used in more contexts. I could be convinced that we need a layer of specifications over a layer of implementations, but I worry that the specification will be vague and that its implementations will drift apart.
@rctauber pushed some work here: https://github.com/rctauber/robot/commit/e2633061bf57c1d776d9e83fd17cf6849b2e52ad
The ReportOperation mostly just loads SPARQL queries, executes them, and reports problems. Each query has "title" and "severity" as structured metadata in comments. This is just a first draft, for discussion. The structured output in #231 is a good idea, but this draft doesn't implement that.
Since my main worry is performance, I'd like to compare this to #231. What would be a good test case?
Cool. Can we test it on a SPARQL query for checking for references to dangling or obsolete classes?
@rctauber has run some initial performance comparisons. As expected, SPARQL is slower and takes more memory, but it's not as bad as I feared.
That's all with Jena/Fuseki. We're going to try Blazegraph now.
You can ping @yy20716 if you want help optimizing queries
On 26 Jan 2018, at 6:17, James A. Overton wrote:
@rctauber has run some initial performance comparisons. As expected, SPARQL is slower and takes more memory, but it's not as bad as I feared.
- For OBI, ECO, and GO, the SPARQL implementation takes about 2-3 times as long to run as the Java.
- With both OWLAPI and SPARQL representations in memory, we need about 100% more memory for OBI and ECO, but just 25% more memory for GO.
That's all with Jena/Fuseki. We're going to try Blazegraph now.
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/ontodev/robot/issues/205#issuecomment-360796089
After a couple of hours trying things out, Blazegraph was consistently slower and used more memory than Jena/Fuseki for this task. That's a bit of a surprise for me, since I usually find Blazegraph performs better. But in this case we're doing a lot of writes and not so many reads, so maybe that explains the difference. @rctauber will push her Blazegraph version to her repo, so you can see if we made any obvious mistakes.
Based on these tests, I think that the performance hit for SPARQL is acceptable. What does everyone else think?
Do you have rough absolute times? I forget how long the java takes
The times are an average of 3 executions via command line. I killed the GO execution with Blazegraph after ~180 seconds.
ECO | OBI | GO | |
---|---|---|---|
Java | 1.44 | 2.54 | 20.43 |
SPARQL | 3.19 | 5.91 | 51.80 |
Blazegraph | 5.32 | 11.57 | ??? |
The most recent change to Blazegraph is here (using Futures): rctauber/robot@70ba125 I forgot to push the dependency so I added that here: rctauber/robot@49b28eb
@yy20716 will take a look at this to see if there are obvious ways to improve the blazegraph performance, and if there are obvious ways to optimize in general
@rctauber, if you don't mind, could you please let us know how we can reproduce your results in the table? I cloned your "reporting" branch but wonder how you switched among Java, SPARQL, and Blazegraph for testing this reporting feature. If you have a set of commands for testing these options, could you please let us know? Thank you.
Hi @yy20716 - I build 3 separate JARs to test each of the methods:
If you'd like me to send you the JARs, I'd be happy to!
Hello @rctauber, it seems that the use of newCachedThreadPool() in getViolation rather leads to the suboptimal performance in the case of Blazegraph. In my machine, the Blazegraph implementation was later failed due to the OOM (after eating up 16GB memory). I also tried with newFixedThreadPool(8), but It's rather faster when I limited the number of the thread as 1, i.e., newFixedThreadPool(1).
I also observed that the use of nested FILTERs with NOT EXIST causes issues. Here is one of the queries that take several minutes in Blazegraph (even if we only executed this query without using any threading). From my understanding, this query checks whether the punctuation exists at the end of the object value described with IOA_0000115.
SELECT DISTINCT ?entity ?property ?value WHERE
{VALUES ?property {obo:IAO_0000115}
?entity ?property ?value .
FILTER NOT EXISTS {FILTER STRENDS(str(?value), ".")}}
The problem is that the input graph pattern is not explicitly given for the NOT EXISTS clause, so most query parsers interpret the clauses as the operation that applies the filter for all triples, which is the ineffective operation. These filtered triples that do not contain dots are then joined with triples whose properties are obo:IAO_0000115; thus it is likely to take a significant amount of processing time. It seems that Jena somehow optimized these operations but Blazegraph processed the query based on the initial interpretation. If my interpretation is correct, this query can be simplified as follows.
SELECT DISTINCT ?entity ?property ?value WHERE
{VALUES ?property {obo:IAO_0000115}
?entity ?property ?value .
FILTER (!STRENDS(str(?value), "."))}
There were other similar queries, so I modified them as well. This modification reduced the execution time up to approx. 2 mins for processing all queries over go-edit.obo. I also modified other queries such as reducing the number of union branches as much as I can. I forgot to rewrite the query formatted-annotation-violation.rq
but it can be further optimized by merging UNIONs and replacing REGEX into CONTAINS with OR. All these modifications will probably also optimizes Jena's execution in some degree, but I didn't compare the performance yet.
I made a pull request at your Github repository and would appreciate it if you could check whether the modification makes sense to you. Thank you.
Thanks so much @yy20716 - I merged that PR and built with the changes. The modifications do make sense, and I appreciate the explanation.
Here's the new (your changes) vs. the old Blazegraph times:
Old | New | |
---|---|---|
ECO | 5.32 | 5.01 |
OBI | 11.57 | 8.32 |
GO | ??? | 92.39 |
GO actually completed, which is a huge improvement. But, Jena still processes in ~40% less time.
I am really glad that it worked. Yes, it is well known that Jena is generally faster than other approaches including Blazegraph when the size of input data is small, i.e. up to 100-200MB data. Most ontologies are not that large (i.e. smaller than 200MB), but Jena's performance quickly becomes worse once the size of data becomes larger. I am not sure but maybe it would be safe to use Blazegraph if the processing time is still acceptable for you or other members of this project. If not, Jena would be a reasonable choice. Thank you.
@cmungall Do you want to move forward with the Java approach (#231) or the SPARQL approach (https://github.com/rctauber/robot/tree/reporting)?
SPARQL
Nothing makes me happier than killing verbose java code I have written. Closed the PR. Declarative all the way
May want to copy this bit of docs back: https://github.com/ontodev/robot/pull/231/files#diff-20d33daf2a298665aebca3d66cf9771aR103
At the moment master is using some procedural checks to warn for dangling axioms that could lead to reasoner incompleteness. This could probably all be moved over later to SPARQL tool
Do we want to leave this open to continue discussing the checks (since this feature is implemented now), or create a new discussion issue (or mailing list convo?)?
If we moved discussion, it could subsume #217 as well.
We should have more focused discussions on particular rules from now on.
ROBOT performs logical checks using
reason
and can run a custom suite of SPARQL queries usingverify
. Some repos successfully use this in combination with Makefiles and a custom suite of queries to implement robust checks in their pipeline (e.g. GO; and also OSK).We should make it easier for groups to do this out of the box. While customization is good, ROBOT should still provide an OBO-ish core set of queries and reports that could be used by OBO central for evaluation purposes. The groups could still customize and say for example "allow some classes to lack definitions".
In GO we are still very reliant on this script, it depends on obo format but it is a good model for a set of potential standard checks: https://github.com/geneontology/go-ontology/blob/master/src/util/check-obo-for-standard-release.pl
This could be implemented by robot distributing some sparql in its resources folder (or a standard obo place to fetch these) and/or procedural code (sometimes easier than sparql)