owlcs / owlapi

OWL API main repository
821 stars 315 forks source link

OBO parser: Treat value of `replaced_by` tag as an IRI #1072

Closed gouttegd closed 1 year ago

gouttegd commented 2 years ago

This PR is a proof-of-concept fix for the issue mentioned in https://github.com/INCATools/ontology-development-kit/issues/642, i.e. the fact that the value of a replaced_by tag in a OBO file is translated into a literal string, instead of an IRI as expected.

The fix has two parts:

balhoff commented 2 years ago

@gouttegd this is great! I suggest handling consider the same way as replaced_by.

cmungall commented 2 years ago

@gouttegd - does your PR fix idspace expansion for ALL CURIEs, or only CURIEs that are used in replaced_by?

gouttegd commented 2 years ago

@cmungall For now, only CURIEs that are used in replaced_by and consider clauses.

I have another branch (not pushed anywhere yet) in which I test replacing CURIEs also in xref clauses, but I found that to be somewhat trickier. I can push that to another PR if you and others want to have a look at it.

balhoff commented 2 years ago

I did some tests and found that this change also deals with term IDs using non-OBO namespaces. PRO (http://purl.obolibrary.org/obo/pr.obo) has statements like this:

idspace: HGNC http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=

And terms:

id: HGNC:100
name: ASIC1 (human)
namespace: gene
def: "A protein coding gene ASIC1 in human." [PRO:DNx]
comment: Category=external.
is_a: SO:0001217 ! protein_coding_gene
relationship: only_in_taxon NCBITaxon:9606 ! Homo sapiens

With the old code this is parsed into:

# Class: <http://purl.obolibrary.org/obo/HGNC_100> (ASIC1 (human))

AnnotationAssertion(Annotation(<http://www.geneontology.org/formats/oboInOwl#hasDbXref> "PRO:DNx") <http://purl.obolibrary.org/obo/IAO_0000115> <http://purl.obolibrary.org/obo/HGNC_100> "A protein coding gene ASIC1 in human.")
AnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#hasOBONamespace> <http://purl.obolibrary.org/obo/HGNC_100> "gene")
AnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HGNC_100> "HGNC:100")
AnnotationAssertion(<http://www.w3.org/2000/01/rdf-schema#comment> <http://purl.obolibrary.org/obo/HGNC_100> "Category=external.")
AnnotationAssertion(<http://www.w3.org/2000/01/rdf-schema#label> <http://purl.obolibrary.org/obo/HGNC_100> "ASIC1 (human)")
SubClassOf(<http://purl.obolibrary.org/obo/HGNC_100> <http://purl.obolibrary.org/obo/SO_0001217>)
SubClassOf(<http://purl.obolibrary.org/obo/HGNC_100> ObjectSomeValuesFrom(<http://purl.obolibrary.org/obo/RO_0002160> <http://purl.obolibrary.org/obo/NCBITaxon_9606>))

With this PR it becomes:

# Class: <http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=100> (ASIC1 (human))

AnnotationAssertion(Annotation(<http://www.geneontology.org/formats/oboInOwl#hasDbXref> "PRO:DNx") <http://purl.obolibrary.org/obo/IAO_0000115> <http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=100> "A protein coding gene ASIC1 in human.")
AnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#hasOBONamespace> <http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=100> "gene")
AnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=100> "HGNC:100")
AnnotationAssertion(<http://www.w3.org/2000/01/rdf-schema#comment> <http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=100> "Category=external.")
AnnotationAssertion(<http://www.w3.org/2000/01/rdf-schema#label> <http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=100> "ASIC1 (human)")
SubClassOf(<http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=100> <http://purl.obolibrary.org/obo/SO_0001217>)
SubClassOf(<http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=100> ObjectSomeValuesFrom(<http://purl.obolibrary.org/obo/RO_0002160> <http://purl.obolibrary.org/obo/NCBITaxon_9606>))

That's really nice. How much do we want to try to handle roundtrip with these changes? If you read in that OBO file, and then write it out as OBO, you get a huge owl-axioms header containing axioms like:

AnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=100> \"HGNC:100\")

and the term stanzas look like:

[Term]
id: http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=100
name: ASIC1 (human)
namespace: gene
def: "A protein coding gene ASIC1 in human." [PRO:DNx]
comment: Category=external.
is_a: SO:0001217 ! protein_coding_gene
relationship: only_in_taxon NCBITaxon:9606 ! Homo sapiens
shawntanzk commented 2 years ago

Do we care about round-tripping (OBO -> OWL -> OBO results in difference) Currently if we don't care about that, it basically works, if not we need to figure out how to get that to work - this will take a lot more work, and we aren't sure if this is worth it? (we cannot guarantee that we will be able to regenerate an OBO file the same as the original at the OBO file) example of where this is an issue: If you use a obo edit file and have a full release in owl that gets converts back into obo file -> ugly URLs instead of nice names Note: this PR as it stands is opt-in Ideal situation: find solution for curie round-tripping @matentzn to give the few options

gouttegd commented 2 years ago

To clarify:

So in this case, “round-tripping” a OBO file (converting from OBO to any other format and then back to OBO again) can and most likely will produce a different OBO file, where IRIs are either condensed into different CURIEs than the ones in the original file, or not condensed at all.

This is the “opt-in” bit mentioned by @shawntanzk above: Non-OBO CURIEs are only expanded to non-OBO IRIs if the editors explicitly asked for it by using a idspace. If such an expansion is unwanted, don’t use idspace. This is a slight extension of the meaning of the idspace tag (from “this prefix is mapped to this IRI scheme” to “this prefix is mapped to this IRI scheme – and by saying so, I am hereby asking the parser to expand any instance of this prefix”), but I believe it makes sense.

So I guess the options are:

A) Statu quo. Don’t expand CURIEs in replaced_by and similar tags, and don’t do anything with idspace tags.

B) Expand CURIEs whenever possible, but without using idspace prefix maps — assume CURIEs are OBO-style CURIEs. This will cause all CURIEs to be expanded to OBO-style IRI (http://purl.obolibrary.org/obo/PREFIX_LOCALID), which may not always be desired — but on the other hand this will ensure the expanded IRIs can always be condensed back into their original CURIEs.

C) Expand CURIEs whenever possible, using prefix maps from idspace tags when present (basically, what this PR is doing). Do not care (at least for now) about what happens to non-OBO CURIEs when writing back to OBO, just tell editors to avoid using idspace if expansion into non-OBO IRIs is not desired.

D) Expand CURIEs whenever possible, using prefix maps from idspace tags when present, and making sure the corresponding expanded IRIs can always be condensed back to their original CURIE forms. This requires a way to convey the prefix maps from one format to another, so that we can provide those maps to the OBO writer.

gouttegd commented 2 years ago

I also want to add that if we go for option D, this should allow to also expand CURIEs in xref tags. The main reason I have not done so in this PR is because cross-references are much more likely to contain non-OBO CURIEs, and so we would be much more likely to run into problems like the one highlighted above.

If we have a way to always let the OBO writer know how to condense arbitrary IRIs into CURIEs, then this would not be a problem anymore.

matentzn commented 2 years ago

Let's talk a bit about option D before diving into xref, which is furiously discussed here.

  1. Trying to piggy-back on the RDF prefix mechanism which most of the serialiser use. This would involve a the assumption that rdf prefix declaration are the same as prefix maps
  2. Inject a SHACL-based prefix map into the ontology header (can OWLAPI even pass this through?) which is recognised by the OBO parsers (but only the OBO parser). SHACL has a standard vocabulary for prefix declaration, see here for an example
  3. Create a custom RDF extension for idspace maps as ontology annotations, much the same way as all other OBO<-> OWL mappings are implemented. Something like: `http://purl.obolibrary.org/obo/mondo.owl oio:idspace "HGNC: https://identifiers.org/hgnc:". This looks similar to ROBOTs --prefix declation and could be viable if 2 is not doable for OWL API reasons.
  4. Externalise the prefix mapping by requiring a context file as an additional input when interpreting prefix strings in an OBO file. We already have mechanisms to parse these. This may need a new process like "generate context for OBO file()".

I have a slide bias to 3 right now, but I am totally open to arguments in all directions.

cmungall commented 2 years ago

I am strongly in favor of @matentzn's option 2. If the shacl namespace vocabulary had existed when we wrote the spec we would have used it.

I tried robot convert -I http://obofoundry.org/registry/obo_prefixes.ttl and it works with all formats (note: for obo serialization it needs to be annotated with an ontology ID)

gouttegd commented 2 years ago

Proof-of-concept (mostly to check that I understand option 2 correctly, the code is ugly):

The following OBO file:

format-version: 1.2
ontology: test
idspace: myprefix http://example.org/myp

[Term]
id: test:001
name: test term
replaced_by: myprefix:002

gives the following OFN file, where the idspace tag has been used to (a) expand the myprefix:002 prefix into http://example.org/myp002 and (b) add SHACL annotations:

Ontology(<http://purl.obolibrary.org/obo/test.owl>
Annotation(<http://www.geneontology.org/formats/oboInOwl#hasOBOFormatVersion> "1.2")
Annotation(<http://www.w3.org/ns/shacl#declare> _:genid2147483648)

Declaration(Class(<http://purl.obolibrary.org/obo/test_001>))
Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/IAO_0100001>))
Declaration(AnnotationProperty(<http://www.geneontology.org/formats/oboInOwl#hasOBOFormatVersion>))
Declaration(AnnotationProperty(<http://www.geneontology.org/formats/oboInOwl#id>))
Declaration(AnnotationProperty(<http://www.w3.org/2000/01/rdf-schema#label>))
Declaration(AnnotationProperty(<http://www.w3.org/ns/shacl#declare>))
Declaration(AnnotationProperty(<http://www.w3.org/ns/shacl#namespace>))
Declaration(AnnotationProperty(<http://www.w3.org/ns/shacl#prefix>))
############################
#   Annotation Properties
############################

# Annotation Property: <http://purl.obolibrary.org/obo/IAO_0100001> (term replaced by)

AnnotationAssertion(<http://www.w3.org/2000/01/rdf-schema#label> <http://purl.obolibrary.org/obo/IAO_0100001> "term replaced by")

# Annotation Property: <http://www.geneontology.org/formats/oboInOwl#hasOBOFormatVersion> (has_obo_format_version)

AnnotationAssertion(<http://www.w3.org/2000/01/rdf-schema#label> <http://www.geneontology.org/formats/oboInOwl#hasOBOFormatVersion> "has_obo_format_version")

############################
#   Classes
############################

# Class: <http://purl.obolibrary.org/obo/test_001> (test term)

AnnotationAssertion(<http://purl.obolibrary.org/obo/IAO_0100001> <http://purl.obolibrary.org/obo/test_001> <http://example.org/myp002>)
AnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/test_001> "test:001")
AnnotationAssertion(<http://www.w3.org/2000/01/rdf-schema#label> <http://purl.obolibrary.org/obo/test_001> "test term")

AnnotationAssertion(<http://www.w3.org/ns/shacl#namespace> _:genid2147483648 "http://example.org/myp")
AnnotationAssertion(<http://www.w3.org/ns/shacl#prefix> _:genid2147483648 "myprefix")
)

When converting back to OBO, the SHACL annotations are used to (a) condense back the expanded IRI into its original CURIE form, and (b) restore the original idspace tag:

format-version: 1.2
idspace: myprefix http://example.org/myp 
ontology: test

[Term]
id: test:001
name: test term
replaced_by: myprefix:002

Is that what you both had in mind?

cmungall commented 2 years ago

perfecto!!!!

gouttegd commented 2 years ago

Great. In that case I don’t see any obstacle against option 2.

The first part (idspace to SHACL annotations, when parsing a OBO file) is trivial to do.

The second part (SHACL to idspace tags and IRIs to CURIEs, when writing a OBO file) requires more work to be done properly (for now I had to do some violence to the Owl2Obo code to get it to work), but is definitely doable.

Of course, if it can be done with SHACL annotations, then logically it could also be done with any other annotation intended to store prefix mappings, so option 3 remains a possibility. But I don’t see the point of creating another annotation, when we already have the SHACL ones.

matentzn commented 2 years ago

Out of curiosity, how does:

AnnotationAssertion(<http://www.w3.org/ns/shacl#namespace> _:genid2147483648 "http://example.org/myp")
AnnotationAssertion(<http://www.w3.org/ns/shacl#prefix> _:genid2147483648 "myprefix")

work with owlapi? what is the type of _:genid2147483648?

gouttegd commented 2 years ago

It’s a OWLAnonymousIndividual. The ID is automatically generated by OWLDataFactory.getAnonymousIndividual().

It would be possible to use an explicit IRI instead (the SHACL specification explicitly allows to use either a IRI or a blank node), but since its only purpose is to “link” the prefix annotation with the namespace annotation, I don’t think there would be anything to gain in doing that?

matentzn commented 2 years ago

Would be interesting to experiment how this would affect module extraction for example. Most likely it would not - since they are ontology annotations, you will have to copy them forward if you want to extract -> convert. Also, a default behaviour for namespace clashes is necessary when multiple ontologies with prefix maps are merged.

gouttegd commented 2 years ago

About conflicting prefix maps: ontologies are merged after they have been parsed, so by the time two (or more) OBO-format ontologies with conflicting maps are merged, the CURIEs they contain will have already been expanded into IRIs — so the resulting merged ontologie will at least contain correct IRIs.

The problem will be when the merged ontology is saved in OBO format. Then:

The case of conflicting namespaces is obviously the most problematic one, since it will lead to incorrect CURIEs. Not sure what the best course of action would be to deal with that case:

gouttegd commented 2 years ago

Independently of the risk of conflicts, I note that something interesting is happening when ontologies are merged with --include-annotations false.

Let’s say we have the following SHACL annotations in ontology A:

Annotation(sh:declare _:genid2147483648)
AnnotationAssertion(sh:namespace _:genid2147483648 "http://example.org/foo")
AnnotationAssertion(sh:prefix _:genid2147483648 "fo")

And the following annotations in ontology B:


Annotation(sh:declare _:genid2147483648)
AnnotationAssertion(sh:namespace _:genid2147483648 "http://example.org/bar")
AnnotationAssertion(sh:prefix _:genid2147483648 "ba")

The merged ontology will contain:


Annotation(sh:declare _:genid2147483648)
AnnotationAssertion(sh:namespace _:genid2147483648 "http://example.org/foo")
AnnotationAssertion(sh:prefix _:genid2147483648 "fo")
AnnotationAssertion(sh:namespace _:genid2147483649 "http://example.org/bar")
AnnotationAssertion(sh:prefix _:genid2147483649 "ba")

Notice how only the ontology annotation from ontology A (the one linking to the fo prefix mapping) is present. The ontology annotation from B (linking to the ba prefix mapping) is absent, as expected with --include-annotations false. However, the annotations on the anonymous individual from ontology B, are present.

Because they are not “linked” to an ontology annotation, they would not be found when we are looking for SHACL prefix maps, so they cannot cause a conflict (so the meaning of the --include-annotations false option is preserved). But I wonder if we should try to get rid of such “floating” annotations?

cmungall commented 1 year ago

@gouttegd - are you sure that is the behavior on merging with blank nodes? That would indicate a problem with the owlapi AFAIK

I think it's good to think through behavior with these annotations. We need to define behavior in the case of conflicting prefixes. I suggest

However it should be left as a problem for the user to make sure the shacl namespaces are well behaved. Merging is one way to introduce problem states, but I don't think this should be treated as an obo format issue. We can have some lightweight checks and operations at the robot level.

gouttegd commented 1 year ago

@cmungall

are you sure that is the behavior on merging with blank nodes? That would indicate a problem with the owlapi AFAIK

This is what I observe when merging the following file:

test1.ofn ``` Prefix(oboInOwl:=) Prefix(sh:=) Prefix(rdfs:=) Ontology( Annotation(oboInOwl:hasOBOFormatVersion "1.2") Annotation(sh:declare _:genid2147483648) Declaration(Class()) Declaration(AnnotationProperty(oboInOwl:hasOBOFormatVersion)) Declaration(AnnotationProperty(oboInOwl:id)) Declaration(AnnotationProperty(rdfs:label)) Declaration(AnnotationProperty(sh:declare)) Declaration(AnnotationProperty(sh:namespace)) Declaration(AnnotationProperty(sh:prefix)) AnnotationAssertion(rdfs:label oboInOwl:hasOBOFormatVersion "has_obo_format_version") AnnotationAssertion(oboInOwl:id "test:001") AnnotationAssertion(rdfs:label "test term") AnnotationAssertion(sh:namespace _:genid2147483648 "http://example.org/foo") AnnotationAssertion(sh:prefix _:genid2147483648 "fo") ) ```

with this one (almost identical, but different prefix map):

test2.ofn ``` Prefix(oboInOwl:=) Prefix(sh:=) Prefix(rdfs:=) Ontology( Annotation(oboInOwl:hasOBOFormatVersion "1.2") Annotation(sh:declare _:genid2147483648) Declaration(Class()) Declaration(AnnotationProperty(oboInOwl:hasOBOFormatVersion)) Declaration(AnnotationProperty(oboInOwl:id)) Declaration(AnnotationProperty(rdfs:label)) Declaration(AnnotationProperty(sh:declare)) Declaration(AnnotationProperty(sh:namespace)) Declaration(AnnotationProperty(sh:prefix)) AnnotationAssertion(rdfs:label oboInOwl:hasOBOFormatVersion "has_obo_format_version") AnnotationAssertion(oboInOwl:id "test:002") AnnotationAssertion(rdfs:label "another test term") AnnotationAssertion(sh:namespace _:genid2147483648 "http://example.org/bar") AnnotationAssertion(sh:prefix _:genid2147483648 "ba") ) ```

with robot convert -i test1.ofn -i test2.ofn -o merged.ofn

merged.ofn ``` Prefix(:=) Prefix(sh:=) Prefix(owl:=) Prefix(rdf:=) Prefix(xml:=) Prefix(xsd:=) Prefix(rdfs:=) Prefix(oboInOwl:=) Ontology( Annotation(oboInOwl:hasOBOFormatVersion "1.2") Annotation(sh:declare _:genid2147483648) Declaration(Class()) Declaration(Class()) Declaration(AnnotationProperty(oboInOwl:hasOBOFormatVersion)) Declaration(AnnotationProperty(oboInOwl:id)) Declaration(AnnotationProperty(rdfs:label)) Declaration(AnnotationProperty(sh:declare)) Declaration(AnnotationProperty(sh:namespace)) Declaration(AnnotationProperty(sh:prefix)) AnnotationAssertion(rdfs:label oboInOwl:hasOBOFormatVersion "has_obo_format_version") AnnotationAssertion(oboInOwl:id "test:001") AnnotationAssertion(rdfs:label "test term") AnnotationAssertion(oboInOwl:id "test:002") AnnotationAssertion(rdfs:label "another test term") AnnotationAssertion(sh:namespace _:genid2147483648 "http://example.org/foo") AnnotationAssertion(sh:prefix _:genid2147483648 "fo") AnnotationAssertion(sh:namespace _:genid2147483649 "http://example.org/bar") AnnotationAssertion(sh:prefix _:genid2147483649 "ba") ) ```

Tested with ROBOT 1.8.4 and 1.9.0 (identical results).

cmungall commented 1 year ago

@gouttegd - apologies, I didn't see the two bnodes differed by one digit, I thought that the two bnodes from the different graphs were merged into one, which would have been a mistake, had it actually happened.

I don't see any major problems here

one thing that might be annoying is that if I merge N ontologies that all have:

[ sh:prefix "ex"; sh:namespace "https://example.org/"]

then I will get

[ sh:prefix "ex"; sh:namespace "https://example.org/"]
[ sh:prefix "ex"; sh:namespace "https://example.org/"]
...
[ sh:prefix "ex"; sh:namespace "https://example.org/"]

Things would get silly if these were included in extracts as we would end up with exponential growth of parasitic duplicative declarations but I don't think these would be included with extract by default

this will be fixed by a laundry cycle through an id-space aware obo serializer. We can imagine having some lightweight robot commands for removing redundant bnodes, detecting bijectivity conflicts etc

ignazio1977 commented 1 year ago

Very interesting and long discussion.

Re merging of blank nodes, parsing two ontologies where the blank nodes have the same ids doesn't cause the in memory blank nodes to clash - blank node ids are remapped during parsing. The node on file will match the triples of the node in memory but the id will be recomputed. This is precisely to avoid clashes (think for example during import resolution; same concept during merges).

The behaviour is optional and can be disabled - removing remapping can be useful in some situations, but it brings with it the risk of clashes and is not compliant with the specs, so by default remapping is on.

ignazio1977 commented 1 year ago

@gouttegd @cmungall I'm trying to figure out if you believe there are remaining issues or if the pull request is complete and can be merged

gouttegd commented 1 year ago

@ignazio1977 It depends on how we answer @balhoff ’s question above: “How much do we want to try to handle roundtrip with these changes?”

For now, we do not try to “handle roundtrip”. As a result, if an ontology in OBO format has a prefix declaration in its header:

idspace: HGNC http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=

and uses IDs with this declared HGNC prefix:

[Term]
id: HGNC:100

then this is handled correctly when parsing from OBO to OWL, but not in the other way around, because for now the OBO writer doesn’t know how to “CURIfy” the expanded IRI back to its original prefixed form.

We already agreed that the solution to that is to save the prefix mappings in the ontology as SHACL declarations, and I have some code to do that in another branch, but it’s not ready to merge yet.

balhoff commented 1 year ago

@gouttegd what's the status of your SHACL prefix mappings code? Want to merge it into here?

gouttegd commented 1 year ago

@balhoff Sorry, I haven’t found any time to work on it since my last message. My code is still nowhere near ready to be merged.

balhoff commented 1 year ago

FYI, I've rebased this PR onto the version4 branch and am adding roundtrip support by making the OBODocumentFormat implement PrefixDocumentFormat. I'll hopefully open a new PR soon.

balhoff commented 1 year ago

This PR is superseded by #1102.