Closed gouttegd closed 8 months ago
@gouttegd thank you so much for doing this analysis, and sorry for apparently not adequately testing this change before the release! It looks like you're already implementing some fixes. Let me know if I can help. I will be away at a project meeting most of this week, so I probably can't do much over the next few days.
Given the fact that the ID cache hides the problem away in most cases, I don’t think any amount of testing on small files would have caught the issue. We only found it when using the new ROBOT on a “real world” file (Uberon).
I believe I indeed have a fix ready, but I want to test it intensively before we go through the whole cycle of releasing a new OWL API, a new ROBOT, a new ODK, and a new Protégé.
@gouttegd thanks for the excellent analysis and the pull request. Keeping this open to update the other branches.
Since PR #1099 (present in OWL API 4.5.26), it is supposedly possible to use arbitrary annotation properties in OBO “qualifier tags”, e.g. one can do this:
to annotate the
SubClassOf
axiom with ahttp://purl.obolibrary.org/obo/PROP_456
property.However, this does not always work. For example, in Uberon, the
IAO:0000116
tag is most of the times translated correctly ashttp://purl.obolibrary.org/obo/IAO_0000116
, but sometimes translated (incorrectly) ashttp://www.geneontology.org/formats/oboInOwl#0000116
— which among other consequences makes the ontology impossible to serialise in RDF/XML, because the RDF/XML writer tries to write the annotation as a<oboInOwl:0000116>
XML tag (which it cannot do because0000116
is not a valid XML tag name).The root cause of the issue lies in the oboIdToIRI_load() method, which I believe does not behave as intended. When called with the
oboInOwlDefault
argument set totrue
, it completely ignores the existing prefix and forcibly translate the Curie into thehttp://www.geneontology.org/formats/oboInOwl#
namespace (lines 1729–1731).That is,
oboIdToIRI_load("IAO:0000116", true)
will returnhttp://www.geneontology.org/formats/oboInOwl#0000116
.I don’t see how this can be the desired behaviour. The
oboInOwlDefault
argument should only have an effect in cases where the tag is not qualified (e.g., to translate{source="lorem ipsum"}
intohttp://www.geneontology.org/formats/oboInOwl#source
). When the tag includes an explicit prefix (e.g.{IAO:0000116="lorem ipsum"}
), the prefix should always be taken into account regardless of theoboInOwlDefault
value.To make things worse, the issue is in many cases hidden by the use of the
idToIRICache
. When translating any OBO ID to an IRI, that cache is queried before any call tooboIdToIRI_load()
. But importantly, theidiToIRICache
does not grow indefinitely: when it reaches a certain size, older entries are removed from it.I believe what happens in the aforementioned Uberon issue is something along the lines of:
editor_note
type definition, which contains a cross-reference toIAO:0000116
. The ID is translated correctly (theoboIdToIRI_load
method is called withoboInOwlDefault
set tofalse
in this context) and put into the ID cache. This always happens first, because even though type definitions are physically located at the end of an OBO file, theTypedef
frames are always translated before theTerm
frames.Term
frame that contains aIAO:0000116="..."
qualifier, the cache is queried for theIAO:0000116
ID. At the beginning, the cache will still contain that ID and will therefore return the correct IRI.IAO:0000116
ID will be dropped from the cache. Then the next time the parser will have to translate aIAO:0000116="..."
qualifier, the cache will miss, andoboIdToIRI_load
will be called to perform the translation withoboInOwlDefault
set totrue
, thereby resulting in the ID being incorrectly translated into thehttp://www.geneontology.org/formats/oboInOwl#
namespace.This would explain why only some
IAO:0000116
annotations are incorrectly translated. This would also explain why the unit test included in PR #1099 didn’t catch the issue: when the parser needs to translate this bit:the
MYONT:20
ID is still in the ID cache (associated with the correct IRI) because it has been correctly translated when processing thesource
type definition:and the test file is not large enough for the ID to have been removed from the cache.