owlcs / owlapi

OWL API main repository
822 stars 315 forks source link

OWLAPI Writers add xsd:string randomly. #1004

Closed matentzn closed 2 years ago

matentzn commented 3 years ago

We have recorded the issue here: https://github.com/protegeproject/protege/issues/993#issuecomment-882637425

The same versions of OWL API and operating systems causes completely different patterns of change. You can see that there are 4 pull requests from the same Protege versions causing widely different serialisations of the edited file..

@ignazio1977 do you think this is fixed in a later version of owlapi 4?

matentzn commented 3 years ago

Edit, the same goes for RDFXML syntax as well..

ignazio1977 commented 3 years ago

I'll try to narrow down on which versions this happens. So far, it looks like the random nature depends on what's in the input. Declared types should be preserved, plain literals without types should stay plain.

matentzn commented 3 years ago

But the input is identical? I mean 4 people with the same OWL API version serialise the same ontology and get different serialisations?

one super odd thing we realised that in all cases, only oio:hasDbXref axiom annotations on other axioms seem to be having the problem.

AnnotationAssertion(Annotation(oboInOwl:hasDbXref "GOC:mtg_heart"^^xsd:string) obo:IAO_0000115 obo:CL_0010005 "A specialized cardiomyocyte that transmit signals from the AV node to the cardiac Purkinje fibers.")

At least that's all the cases we have been observing.. This seems.. Super odd.

ignazio1977 commented 3 years ago

My best guess at this moment is that their classpaths differed - possibly even to the extent of having clashing OWLAPI versions in there.

I'm going to write something to cycle through all OWLAPI 4 versions and see if that sheds any light.

matentzn commented 3 years ago

I am gathering that intel now for us.

matentzn commented 3 years ago

Ok the more I work on this, the more these things become obvious:

  1. All ontologies directly affected are managed in OBO format (uberon, fbbt), and converted to RDFXML or functional syntax with ROBOT. Indirectly affected ontologies depend on ontologies managed directly in OBO format (like CL)
  2. 99.5% of the diff is due to missingxsd:string pertaining only to annotation properties in the oboInOwl syntax
  3. Using OWLAPI 4.5.19 alleviates some, but not all of the problems - there is still quite some diff, and the OBO format also has some mistakes now (mainly redundant quoting, see https://github.com/obophenotype/cell-ontology/issues/1175#issuecomment-885737083
ignazio1977 commented 3 years ago

Ok, so, to work with examples:

I can't find an fbbt file but if there's one please point me at it @matentzn

ignazio1977 commented 3 years ago

On uberon_import, there's one roundtrip difference that's due to multiple qualifiers with the source name, when sorted these values switch from the order in the input. One source is http://..., one is DOI..., but in the stored file the http is alwys first and when ordering annotations it comes always last.

Bit of a problem, as I don't think we can avoid a large patch set just to ensure the order of these is consistent; I guess the original ws created with an OWLAPI version that didn't sort annotations or it was created with different tool.

We can work around this with a different sort criterion on the Clause class, but we'd need to figure what it would be. Something like "always sort source qualifier values with url values first"? Seems a bit too ad hoc to me but it depends on what other tooling/specs expect. @cmungall any hint or preferences?

(we can always just read and save, accept a large patch set, and carry on from there)

matentzn commented 3 years ago

Thank you @ignazio1977 -

This is a good file to play with:

Other things I observed today:

ignazio1977 commented 3 years ago

Oh interesting: re 'unnecessary' quoting, that seems the result of a pull request a while ago: Update OBOFormatWriter.java to emit OBO 1.4 compliant text

833

Seems like in OBO 1.4 that's expected behaviour.

matentzn commented 3 years ago

@ignazio1977

Is there any chance at all to avoid:

image

I have now I think exhausted all possibilities for controlling the diffs, but this one I cannot see a way to solve it.

ignazio1977 commented 3 years ago

Those are blank node ids remapped on reading - usually those ids would not be written at all, unless there are ambiguities, such as use of the same node in multiple places because of annotations.

By default their values are remapped on parse, that's why they change over time; it is possible to switch off that remapping, but it comes with dangers - the remapping is there to avoid clashes between blank node ids in imported ontologies and when axioms are moved between ontologies (or RDF files are 'imported' and they end up included in the importing ontology - a funny spec feature that has caused many a surprise)

To disable, set ConfigurationOptions.REMAP_IDS to false (default is true). This can be done setting an environment variable, amending a file or changing the setup for OWLOntologyManager::getOntologyConfigurator. These do all the same job, but the environment variable applies to all ontology loading operations (depending on whether you set the variable programmatically in your main app or at OS level, the scope will be all managers in the current VM or all managers in ALL VMs on the same computer), while the ontology manager level setup applies to just that manager.

Thread carefully as using this option might have unforeseen effects over time.

ignazio1977 commented 3 years ago

So far I've got to sort the literals reliably - seems to be the only actionable issue I can see, the rest is trouble with poor sorting in previous API versions.

matentzn commented 3 years ago

I think I would rather solve this issue by getting rid of these ambiguities.

So in protege, I see this as one example where a genid is introduced.

image

These are the relevant pieces in the RDFMXL:

<owl:Class rdf:about="http://purl.obolibrary.org/obo/UBERON_0000003">
<rdfs:subClassOf rdf:nodeID="genid13573"/>
</owl:Class>

.....

<owl:Restriction rdf:nodeID="genid13573">
        <owl:onProperty rdf:resource="http://purl.obolibrary.org/obo/BFO_0000050"/>
        <owl:someValuesFrom rdf:resource="http://purl.obolibrary.org/obo/UBERON_0000033"/>
    </owl:Restriction>
    <owl:Axiom>
        <owl:annotatedSource rdf:resource="http://purl.obolibrary.org/obo/UBERON_0000003"/>
        <owl:annotatedProperty rdf:resource="http://www.w3.org/2000/01/rdf-schema#subClassOf"/>
        <owl:annotatedTarget rdf:nodeID="genid13573"/>
        <oboInOwl:source rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ZFA</oboInOwl:source>
    </owl:Axiom>

I don't quite understand how this causes an ambiguity - how would I remove the ambiguity at source (short of removing the axiom annotation altogether). Is there somewhere another axioms that gets confused by this subclassOfAxiom?

matentzn commented 3 years ago

Or is it that RDFXML simply cannot have blank nodes as annotationTarget or Source? ... Hmmm that must be it.

ignazio1977 commented 3 years ago

The problem there is that the same node is referenced in two places in the file. Yes, RDF/XML and axiom annotations :-( can't be avoided without losing the annotation. The only workaround I can suggest is to create a named class equivalent to the restriction and use that in its place, that means the blank node only occurs once.

ignazio1977 commented 3 years ago

(I assume using a different serialization language isn't a possibility? Functional syntax is way better on annotations.)

ignazio1977 commented 2 years ago

Another recent issue has provided an example that revealed the bug fix for #640 never made it to version 4. It's included in the version4 branch now, will be released with 4.5.22. This would have been the unnecessary xsd:string popping up.

matentzn commented 2 years ago

Awesome, we were just about to upgrade all our tools, so we will do that to 4.5.22. Thank you!

ignazio1977 commented 2 years ago

@matentzn 4.5.22 released

jclerman commented 1 year ago

Hi @ignazio1977, thanks for your work on this!

I just spent a while trying to track down which version(s) of OWLAPI have fixed the issue of xsd:string being added "randomly". Looks like the fix is in OWLAPI 5.1.2 and 4.5.22. The former (5.1.2) I was able to determine by looking at the changelog in the README, but it took longer to figure out that the fix is also in 4.5.22. Could the README be updated (maybe makes sense to update it in both the default version5 branch as well as the affected version4 branch?) to cover the changes in recent versions of OWLAPI 4.x? Looks like the last documented version is 4.5.20. Thanks!