owlcs / owlapi

OWL API main repository
828 stars 315 forks source link

Sort annotation assertions on output #332

Closed sesuncedu closed 9 years ago

sesuncedu commented 9 years ago

I can't remember if this is fixed in 4.X, but in 3.5.1, annotation assertions are not sorted deterministically on output.

This leads to spuriously large diffs for things like GO; e.g. one day's worth of changes might yield 8,378 additions/deletions (FSS); sorting shows only 348 lines of diff.

sesuncedu commented 9 years ago

Examples:

***************
*** 43584,43591 ****
  AnnotationAssertion(oboInOwl:hasOBONamespace obo:GO_0000137 "cellular_component"^^xsd:string)
  SubClassOf(obo:GO_0000137 obo:GO_0031985)
  AnnotationAssertion(oboInOwl:id obo:GO_0000138 "GO:0000138"^^xsd:string)
- AnnotationAssertion(Annotation(oboInOwl:hasDbXref "GOC:mah"^^xsd:string) oboInOwl:hasRelatedSynonym obo:GO_0000138 "late Golgi"^^xsd:string)
  AnnotationAssertion(Annotation(oboInOwl:hasDbXref "ISBN:0815316194"^^xsd:string) obo:IAO_0000115 obo:GO_0000138 "The Golgi cisterna farthest from the endoplasmic reticulum; the final processing compartment through which proteins pass before exiting the Golgi apparatus; the compartment in which N-linked protein glycosylation is completed."^^xsd:string)
  AnnotationAssertion(oboInOwl:hasOBONamespace obo:GO_0000138 "cellular_component"^^xsd:string)
  AnnotationAssertion(rdfs:label obo:GO_0000138 "Golgi trans cisterna"^^xsd:string)
  SubClassOf(obo:GO_0000138 obo:GO_0031985)
--- 43593,43600 ----
  AnnotationAssertion(oboInOwl:hasOBONamespace obo:GO_0000137 "cellular_component"^^xsd:string)
  SubClassOf(obo:GO_0000137 obo:GO_0031985)
  AnnotationAssertion(oboInOwl:id obo:GO_0000138 "GO:0000138"^^xsd:string)
  AnnotationAssertion(Annotation(oboInOwl:hasDbXref "ISBN:0815316194"^^xsd:string) obo:IAO_0000115 obo:GO_0000138 "The Golgi cisterna farthest from the endoplasmic reticulum; the final processing compartment through which proteins pass before exiting the Golgi apparatus; the compartment in which N-linked protein glycosylation is completed."^^xsd:string)
+ AnnotationAssertion(Annotation(oboInOwl:hasDbXref "GOC:mah"^^xsd:string) oboInOwl:hasRelatedSynonym obo:GO_0000138 "late Golgi"^^xsd:string)
  AnnotationAssertion(oboInOwl:hasOBONamespace obo:GO_0000138 "cellular_component"^^xsd:string)
  AnnotationAssertion(rdfs:label obo:GO_0000138 "Golgi trans cisterna"^^xsd:string)
  SubClassOf(obo:GO_0000138 obo:GO_0031985)
ignazio1977 commented 9 years ago

I've surely tried to address ordering of statements but that's one multiheaded beast. Examples welcome, cheers. I'll have a GO.

sesuncedu commented 9 years ago

Rdf/XML is worse (as required my law).

I don't know that the order has to be "right" - it just has to be consistent. There is a compression win for completely sorted axioms in FSS, which prioritizes annotations over annotation subjects, but that is likely to to be worse for VCS deltas.

SPO-A (where A is annotations in PO-A order) seems like a good choice.

ignazio1977 commented 9 years ago

I agree that consistency is important. However, it's harder to accomplish consistency with the existing files AND with the newly created files - we'd need somehow to preserve information on the original order, if that order does not match a known ordering. Which is evil, as ironically existing poorly ordered ontologies would generate large diffs until all their copies get roundtripped through the new ordering, which might take a while.

Anyway, rants aside, the intention has always been to achieve reliable, repeatable ordering, as economically as possible; with the option of preserving the exact input order. This is not guaranteed yet, though.

sesuncedu commented 9 years ago

Yeah - at the moment sorting of assertions seems to be by subject, but then non-deterministic (which would match what I remember of the implementation). I think I will try sorting the per-subject annotation assertions and see what the output looks like/diffs to (and will try and remember how much is special cased for rdf/xml).

Simon

matthewhorridge commented 9 years ago

There are some guidelines for how OBO files should be sorted (e.g. labels first). They seem quite sensible, so I suggest we follow these. I'll try and dig out the link to the document.

cmungall commented 9 years ago

http://oboformat.googlecode.com/svn/trunk/doc/GO.format.obo-1_2.html#S.3.5

Of course this is easier to do in obo format which has severe limitations on nesting

tgbugs commented 9 years ago

Hi all. We are in the process of developing a set of ontologies for the Human Brain Project that will be regularly updated. We would really appreciate this feature. Having human readable git diffs as a result of deterministic output (matching input is less of a concern) would be enormously useful for our project and others who would like to use vcs diffs as a quick mechanism to track and verify changes. Thanks! Tom

sesuncedu commented 9 years ago

Owl FSS and the RDF formats are in sorted in 4.0.2. Manchester is on the to-do list. Git deltas work much better with sorted output, though some of the web hosting sites can be reluctant to look at diffs for large files (even if the delta is just a few lines). (for example, Bitbucket requires an extra confirmation.) On May 4, 2015 9:09 PM, "Tom Gillespie" notifications@github.com wrote:

Hi all. We are in the process of developing a set of ontologies for the Human Brain Project that will be regularly updated. We would really appreciate this feature. Having human readable git diffs as a result of deterministic output (matching input is less of a concern) would be enormously useful for our project and others who would like to use vcs diffs as a quick mechanism to track and verify changes. Thanks! Tom

— Reply to this email directly or view it on GitHub https://github.com/owlcs/owlapi/issues/332#issuecomment-98897810.

cmungall commented 9 years ago

This is great, thanks Simon.

Is the sorting strategy documented?

I'd like to avoid the situation whereby the strategy drifts over time, and we get spurious diffs when people use different versions of the owlapi and/or Protege. Obviously there has to be at least one such switch point per concrete syntax but I'd rather avoid multiple changes & the coordination headache

tgbugs commented 9 years ago

Ah excellent, thank you! I guess this means I should stop by and heckle the protege devs to integrate 4.0.2 support (or try to build protege with 4.0.2 myself). Thanks again! Tom

ignazio1977 commented 9 years ago

The Protege people are well aware :-) Matthew needs another functionality integrated in the OWL API before he can finish the update. In the meantime, you can try my experimental build, it has 4.0.2 and the basic plugins. https://github.com/ignazio1977/protegetests/blob/master/protege-distribution-5.0.0-beta-18-SNAPSHOT-platform-independent.zip

sesuncedu commented 9 years ago

That reminds me- should the protégé version be bumped to 6, since the default OS*I version range from maven is [minor, next-major).

(BTW, OSGI and maven version ranges are differently broken wrt qualified versions etc. - OSGI considers 5.0.0-beta-SNAPSHOT >5.0.0, maven doesn't).

ignazio1977 commented 9 years ago

Manchester syntax now included. Sorting is based on the short form providers for IRIs, and on the compareTo() methods for all other OWLObjects. For any entities in the same namespaces, the two orderings are equivalent. For entities not in the same namespaces, results may vary if namespace and namespace abbreviations are not in the same relationships.

This fact probably needs a quick check, as different output syntaxes might rely on slightly different sortings. As long as prefixes are kept through roundtrips, though, the results are reliable in terms of diffs.

(Prefixes are kept by default now; entity substitution is still optional and needs to be turned on).

ignazio1977 commented 9 years ago

One notable exception to sorting is in SWRL rules heads and bodies; although they are ultimately made of OWLObjects (unless variables), they are not sorted and input order is preserved. See #140 for background.

sesuncedu commented 9 years ago

Also,  role chains...  I caught that one before the semicolon,  so technically it doesn't count.

I have been thinking about whether rendering with different options for ordering might be nice,  especially for the human friendly formats like OMN. The big downside is that this has the potential to generate very large diffs for VCS systems,  since a change in the asserted hierarchy causes a lot of transpositions.  Of course,  if the document is being formatted for human consumption, these are real changes, not random. (Not many VCS diff algorithms have transposition as an explicit edit.)

Some support code that would be useful would be more tunable walkers. Specifically:

  1. Improved control over the order in which axioms are visited. Sorting is part of this; it's also useful to be able to be notified about group boundaries.
  2. Improved control over the order in which multi-valued fields are visited.
  3. Improved control over which fields are visited. This is easy enough to do with method references in jdk8, but is a challenge for Hotspot.
  4. Improved support for common behavior based on supertypes. A lot of duplicated visitor code can be saved if code handling n-ary operands, annotations, etc could be dispatched to automatically. This can be handled in an adapter, but is conceptually tied in to walkers wrt the order in which fields are traversed.

It would be nice to be able to pass a second argument to visit & accept, but that is a separate issue.

It might also be nice to have a mapping walker, which would handle constructing new objects from the old. A destructive Walker could also be handy (remove axioms / ontologies etc after they are processed).


[I've been doing a lot of experiments on the effect of different ordering on compression rates for binary owl and variants. The answer is a lot :)  Just using the standard sort on axioms gets about a 20% win.

For GO, a sorted literal table compresses better than one ordered by IRIs associated subject or axiom in sort of hierarchical order,  but with delta coding the overall compressed size is far worse. Based on the average of the mean 4-gram overlap coefficient, looking at a sliding window of 100 strings centered on each string,   I'm currently closer to the sorted score than to a random shuffle,  so at least it's not pessimal.]

(Also, trying to generate a listing of the top 5 globally best matches for each string, using all cores, with the discrete GPU active, is not a good idea on a late 2011 MacBook Pro.
Fortunately Apple already lost/settled the class action, so the repair is free.

O(n^2) is in PTIME. It's still O(n^2).)

ignazio1977 commented 9 years ago

Lots of useful suggestions, do you mind if I put them in separate issues? Otherwise we risk them drowning under the weight of the MarkDown above it.

ignazio1977 commented 9 years ago

8e209d3f91da292c9f654f99ff1d6212f432352a is related to this.