owlcs / owlapi

OWL API main repository
826 stars 315 forks source link

Axiom ordering - sort or keep original order? #273

Closed ignazio1977 closed 8 years ago

ignazio1977 commented 10 years ago

Having chanced upon http://douroucouli.wordpress.com/2014/03/30/the-perils-of-managing-owl-in-a-version-control-system/ I wonder if the OWLAPI should alleviate the pain of diffs on Turtle/XML syntaxes.

The simplest solution I can think of is a counter on OWLObject that keeps track of the order in which the objects were created. Then, when sorting axioms, class expressions and what have you for output, use it together with the current criteria.

Example:

Ontology contains three equivalent axioms, one class assertion

During parsing, the axioms are numbered 1, 2, 3, 4

Add new equivalent axiom, numbered 5

Output order is 1, 2, 3, 5, 4

i.e., the new equivalent axiom is the last of the equivalent axioms list.

What are your thoughts? @matthewhorridge @cmungall anyone else?

ansell commented 10 years ago

Given that OWLAPI Turtle and RDF/XML files are rendered based on categories (classes/individuals/etc.), the counter in that example may need to be localised to the category.

From a technical point of view, it shouldn't be difficult to implement a counter, as in practice all changes go through the OWLOntologyManager so there would just need to be an AtomicLong for each category, basically.

cmungall commented 10 years ago

Not totally sure I understand the numbering strategy. Would the ordering be lost if it were roundtripped via a non-number preserving format?

My naive thoughts were that it would be possible to define a sort order on any set of constructs. For example:

ignazio1977 commented 10 years ago

Yes, defining an ordering is good but it does not allow to preserve the existing structure, e.g., if an ontology file with the "wrong" ordering is read, the output will not play well with the previous version. It will work well with successive versions, but there would need to be an 'update' step. It's basically the same problem you mention about the ordering being lost roundtripping with another tool. I'm not sure there's a catchall solution here.

cmungall commented 10 years ago

I guess I'm OK with that. But my bias is primarily to the VCS use case.

I can see how your proposal would be nice if people were hand-editing the files and there was some axiom ordering that was appealing to them & they wished to preserve it. But I think anyone hand-editing rdf/xml long term would be certifiable (we've all done it short term...)

matthewhorridge commented 10 years ago

Would these be implemented as separate comparators?

Perhaps ontologies, axioms, class expressions etc. and the objects that they contain should preserve the order that they are supplied with. Sorting on rendering, or whenever required, could just use the appropriate comparator.

I would actually like to have a well defined sort order for things like creating a digest of a set of axioms (unless there is a better way of doing this).

sesuncedu commented 10 years ago

Sorting seems to give a big improvement in compression ratios (at least for FSS).

whitten commented 10 years ago

Since each axiom is true and effectively ANDed together, and since AND is idempotent, there should not be any "preferred" order for axioms. First Order Logic requires that they all be treated as if they have no particular order, so a topographic sort should work fine.

David Whitten 713-870-3834

On Fri, Aug 15, 2014 at 8:20 PM, Simon Spero notifications@github.com wrote:

Sorting seems to give a big improvement in compression ratios (at least for FSS).

— Reply to this email directly or view it on GitHub https://github.com/owlcs/owlapi/issues/273#issuecomment-52375650.

ignazio1977 commented 10 years ago

Of course the semantics of the ontologies is unaffected by the order of axioms.

The point of this change is purely to minimise changes to the text output, for the greater good of text based version control systems and other non OWL aware tooling.

ignazio1977 commented 9 years ago

Simon sorted some of the syntaxes, save for manchester (and the legacy ones, e.g., krss).

cmungall commented 9 years ago

@sesuncedu which version of the owlapi are these fixes in? Useful to know for ensuring everyone's Protege is in sync

cmungall commented 9 years ago

@ignazio1977 do you know?

ignazio1977 commented 9 years ago

Should be in all versions. I'll double check.

cmungall commented 9 years ago

What does all versions mean? I'm trying to figure out which versions of protege support this, and whether we need a new protege build

ignazio1977 commented 9 years ago

All most recent versions: 3.5.2, 4.0.2 and version 5 master. It's included in the 4.1.0 release candidate as well.

From past experience, Protege 4.3 and 5 can be adapted to use 3.5.2 by dropping the 3.5.2 osgidistribution jar in the protege plugins folder.

cmungall commented 9 years ago

I am using Protege 5beta18 snapshot, it saves with this version of the owlapi:

<!-- Generated by the OWL API (version 3.5.3.20150903-2211) http://owlapi.sourceforge.net -->

Yet we still get spurious diffs, e.g. https://github.com/oborel/obo-relations/commit/f9e17bf16fedaf316c9ebafe1e7f3d0ef5873ce7

This is in RDF/XML. I'm going to re-open as my understanding was that the intent was to implement deterministic ordering for non-legacy syntaxes (unless rdf/xml is considered legacy...)

Feel free to re-close but let me know where this is fully implemented

ignazio1977 commented 9 years ago

I seem to have missed a commit on 3.5.2 when I checked. My bad.

cmungall commented 9 years ago

OK, was that just for rdf/xml or does it affect all?

ignazio1977 commented 9 years ago

Not sure yet, looks like Turtle and RDF/XML

sesuncedu commented 9 years ago

I just finished slouching in to Bethlehem so not really brain-enabled, but I think the relevant code is in one of the base rdf renderers. (I know that in version 4 it changed blank node ids for the rio writers (since I had to adjust test cases)

On Thu, Oct 8, 2015 at 6:57 PM, Ignazio Palmisano notifications@github.com wrote:

Not sure yet, looks like Turtle and RDF/XML

— Reply to this email directly or view it on GitHub https://github.com/owlcs/owlapi/issues/273#issuecomment-146710848.

ignazio1977 commented 9 years ago

@cmungall I've fixed the issue, but one problem you'll see for that ontology is that the next save will still introduce random changes - the previous versions were not sorted. After that things should normalize.

I'll put a Protege build with the updated jar up for evaluation once I'm done.

ignazio1977 commented 9 years ago

@sesuncedu one thing I'm not clear about is the change to RDFXMLRenderer

private void writeCommentForEntity(String msg, OWLEntity entity) {
    checkNotNull(entity, msg);
    String iriString = entity.getIRI().toString();
    String labelString = labelMaker.getShortForm(entity);
    String commentString = null;
    if (!iriString.equals(labelString)) {
        commentString = labelString;
    } else {
        commentString = iriString;
    }
    writer.writeComment(XMLUtils.escapeXML(commentString));
    }

If I interpret the results correctly, this will change the banner in XML files to use the (one of the) labels for the entity being written out. That sounds like a great idea to me, but it will also introduce a number of changes to existing ontologies. Was the intention to make this configurable?

ignazio1977 commented 9 years ago

Now fixed in the version3 branch, I've used the ontology linked by @cmungall to verify and cherry picked the manchester syntax sorting as well. The sorting test is now the same for version 3 and 4.

I've not enabled @sesuncedu's change to use a label in RDF/XML banner for entities, as this would introduce more changes in the output. I'm planning to add it and make it switchable.

ignazio1977 commented 9 years ago

Test build available here: https://github.com/ignazio1977/protegetests/blob/master/Protege-5.0.0-beta-18-SNAPSHOT_owlapi354-snapshot.zip

Public-Health-Bioinformatics commented 7 years ago

I couldn't quite tell from the thread, so could someone summarize the sort order employed now in OWLAPI/Protege >= 5.0.0-beta-18? Is it deterministic down to the triplet? Does it parallel being able to sort an XML document by tag name, and then by attribute name and value, then content, or some-such? It sounds like OWLAPI sorts before it writes out to various formats, which sounds great.

In other words, now we do have diff'able ontology output via OWLAPI and Protege, with no caveats?

I appreciate all the work done on this!

sesuncedu commented 7 years ago

It should now be deterministic.

Hedge case:

It is still possible for small ontology level changes to generate disproportionately large textual changes. I believe that this should only occur if the output format explicitly renders all blank nodes (e.g. N-triples), and the ontology level change alters the number of blank nodes required to render some axioms (and which have other axioms rendered after them).

There's not much that can be done about this, as these blank nodes don't exist at the OWL level. Fortunately these blank nodes are not explicitly rendered in most formats.

Some metrics using GO are in my 2015 owled paper - http://cgi.csc.liv.ac.uk/~valli/OWLED2015/OWLED_2015_paper_12.pdf

On Wed, Jan 18, 2017, 1:59 PM Damion Dooley notifications@github.com wrote:

I couldn't quite tell from the thread, so could someone summarize the sort order employed now in OWLAPI/Protege >= 5.0.0-beta-18? Is it deterministic down to the triplet? Does it parallel being able to sort an XML document by tag name, and then by attribute name and value, then content, or some-such? It sounds like OWLAPI sorts before it writes out to various formats, which sounds great.

In other words, now we do have diff'able ontology output via OWLAPI and Protege, with no caveats?

I appreciate all the work done on this!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/owlcs/owlapi/issues/273#issuecomment-273567284, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZIGxBtUs4e2s3f6-aAkW9n1tKJK3GAks5rTmDzgaJpZM4CXO96 .

ignazio1977 commented 7 years ago

To the extent that it can be tested, it is deterministic and tested to stay so. As @sesuncedu said, this is not an absolute absolute, due to a few things. However, blank node ids are generated in sequence when parsing and are used in sorting blank nodes, so corner cases should be fairly uncommon.

Node identity comes after a number of other factors; ordering is implemented as follows:

Sequences of axioms or any other OWL objects are sorted by type first, then by values of contained properties/expressions, down to IRI (alphabetical) when necessary. Most of the time this is enough to have stable order.

cmungall commented 7 years ago

Everything has been working perfectly for me for the last year or so.

Public-Health-Bioinformatics commented 7 years ago

Great, thanks for this feedback.