owlcollab / oboformat

Automatically exported from code.google.com/p/oboformat
5 stars 2 forks source link

OBOFormat parser: xrefs create rdfs:label annotations #89

Open leechuck opened 9 years ago

leechuck commented 9 years ago

This is a copy of https://github.com/owlcs/owlapi/issues/415; I think the issue may be better resolved here and then merged into owlapi.

For some ontologies in OBO format, xref clauses in the OBO file result in rdfs:label annotations to the class, with a part of the xref as value.

Example code (using OWLAPI 4.1.0-RC2, but also present in earlier 4.X versions):

@Grapes([
      @Grab(group='org.semanticweb.elk', module='elk-owlapi', version='0.4.2'),
      @Grab(group='net.sourceforge.owlapi', module='owlapi-api', version='4.1.0-RC2'),
      @Grab(group='net.sourceforge.owlapi', module='owlapi-apibinding', version='4.1.0-RC2'),
      @Grab(group='net.sourceforge.owlapi', module='owlapi-impl', version='4.1.0-RC2')
    ])

import org.semanticweb.elk.owlapi.ElkReasonerFactory;
import org.semanticweb.owlapi.apibinding.OWLManager;
import org.semanticweb.owlapi.reasoner.*
import org.semanticweb.owlapi.vocab.OWLRDFVocabulary;
import org.semanticweb.owlapi.model.*;
import org.semanticweb.owlapi.io.*;
import org.semanticweb.owlapi.owllink.*;
import org.semanticweb.owlapi.util.*;
import org.semanticweb.owlapi.search.*;

OWLOntologyManager oManager;
OWLDataFactory df = OWLManager.getOWLDataFactory() ;
OWLOntologyManager lManager = OWLManager.createOWLOntologyManager()
OWLOntologyLoaderConfiguration config = new OWLOntologyLoaderConfiguration()
config.setFollowRedirects(true)
config = config.setMissingImportHandlingStrategy(MissingImportHandlingStrategy.SILENT)
def fSource = new FileDocumentSource(new File('chebi.obo'))
def ontology = lManager.loadOntologyFromOntologyDocument(fSource, config)

def cl = df.getOWLClass(IRI.create("http://purl.obolibrary.org/obo/CHEBI_82978"))
EntitySearcher.getAnnotations(cl, ontology, df.getRDFSLabel()).each { annotation -> // OWLAnnotation
  println annotation
}

The desired output is:

Annotation(rdfs:label "paliperidone"^^xsd:string)

The output I receive is:

Annotation(rdfs:label "CAS Registry Number"^^xsd:string)
Annotation(rdfs:label "PubMed citation"^^xsd:string)
Annotation(rdfs:label "Reaxys Registry Number"^^xsd:string)
Annotation(rdfs:label "paliperidone"^^xsd:string)
Annotation(rdfs:label "HMDB"^^xsd:string)
Annotation(rdfs:label "Wikipedia"^^xsd:string)
Annotation(rdfs:label "DrugBank"^^xsd:string)
Annotation(rdfs:label "KEGG DRUG"^^xsd:string)

The annotations are generated from the following statements in the OBO file:

xref: CiteXplore:25147315 "PubMed citation"
xref: CiteXplore:24962437 "PubMed citation"
xref: CiteXplore:24597755 "PubMed citation"
xref: KEGG DRUG:144598-75-4 "CAS Registry Number"
xref: CiteXplore:24289141 "PubMed citation"
xref: CiteXplore:24324228 "PubMed citation"
xref: CiteXplore:25377151 "PubMed citation"
xref: CiteXplore:24690136 "PubMed citation"
xref: KEGG DRUG:D05339 "KEGG DRUG"
xref: CiteXplore:24556260 "PubMed citation"
xref: CiteXplore:24462396 "PubMed citation"
xref: CiteXplore:24400982 "PubMed citation"
xref: Reaxys:8808385 "Reaxys Registry Number"
xref: CiteXplore:24829608 "PubMed citation"
xref: CiteXplore:25085446 "PubMed citation"
xref: CiteXplore:24491033 "PubMed citation"
xref: DrugBank:DB01267 "DrugBank"
xref: CiteXplore:24346811 "PubMed citation"
xref: Wikipedia:Paliperidone "Wikipedia"
xref: CiteXplore:23428785 "PubMed citation"
xref: HMDB:HMDB15396 "HMDB"
xref: CiteXplore:24752928 "PubMed citation"
xref: ChemIDplus:144598-75-4 "CAS Registry Number"

i.e., the rdfs:label is created from the source, in quotes, after the xrefs.

The problem seems to be in line 1210 of Obo2Owl.java (https://github.com/owlcollab/oboformat/blob/gh-pages/src/main/java/org/obolibrary/obo2owl/Obo2Owl.java#L1210, or https://github.com/owlcs/owlapi/blob/b2401a615a1151244a48b5c0ce88a5a98995a678/oboformat/src/main/java/org/obolibrary/obo2owl/OWLAPIObo2Owl.java#L1392 in OWLAPI) where an rdfs:label is created based on xref values. This breaks identifying the label for several ontologies in OBO format.

cmungall commented 9 years ago

Thanks for the report. The OWLAPI is the correct place to make any fix to the parser (the source in this repo is legacy). However, this is the correct tracker to use for clarifications in the spec, and a clarification is indeed required here.

The BNF allows for quoted strings after an xref, but there is no translation of this to OWL specified. The guide is more expansive, but doesn't really provide any semantics. It does note that this feature is discouraged.

There are 3 possibilities here:

  1. The quoted strings are treated like !s and are silently discarded
  2. The quoted strings are treated as rdfs:labels for the entities referenced by the dbxref
  3. The quoted strings are treated as axiom annotations on the dbxref annotation

Note that some ontologies like to retain their dbxref strings, and are reliant on the owlapi for roundtripping, so would be displeased with 1.

The assumption we implicitly made in the spec was that the strings would be used as labels, and thus 2 is valid. This indeed works for the two main ontologies reliant on this that I am aware of GO and HP.

However, CHEBI violates this unwritten assumption. Translation 3 would be safe here, as people could put whatever they like and the string would always be scoped to that particular assertion. This would work fine for GO and HP too.

So there is a strong argument for (3). However, there are some things to be aware of. Because a dbxref can appear as provenance in a def or synonym annotation, we have potentially depth=2 of axiom annotations. We would like to do tests on all OWL serializations and parsers on this since AFAICT this has never been used in the wild before.

Alternatively, if we adopt (2) as official then CHEBI will have to be changed.

Comments? @dosumis?

Either way, I will contact CHEBI and see if they are open to modifying their obo export and simply drop the strings. I don't think they serve any purpose to anyone, as they just repeat the ID space of the xref.

cmungall commented 9 years ago

https://sourceforge.net/p/chebi/curator-requests/2448/

ignazio1977 commented 9 years ago

I've reproduced the issue and found that the following axioms are parsed:

AnnotationAssertion(Annotation(rdfs:label "PubMed citation"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "CiteXplore:25147315"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "PubMed citation"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "CiteXplore:24962437"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "PubMed citation"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "CiteXplore:24597755"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "CAS Registry Number"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "KEGG DRUG:144598-75-4"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "PubMed citation"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "CiteXplore:24289141"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "PubMed citation"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "CiteXplore:24324228"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "PubMed citation"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "CiteXplore:25377151"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "PubMed citation"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "CiteXplore:24690136"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "KEGG DRUG"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "KEGG DRUG:D05339"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "PubMed citation"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "CiteXplore:24556260"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "PubMed citation"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "CiteXplore:24462396"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "PubMed citation"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "CiteXplore:24400982"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "Reaxys Registry Number"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "Reaxys:8808385"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "PubMed citation"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "CiteXplore:24829608"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "PubMed citation"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "CiteXplore:25085446"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "PubMed citation"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "CiteXplore:24491033"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "DrugBank"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "DrugBank:DB01267"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "PubMed citation"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "CiteXplore:24346811"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "Wikipedia"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "Wikipedia:Paliperidone"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "PubMed citation"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "CiteXplore:23428785"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "HMDB"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "HMDB:HMDB15396"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "PubMed citation"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "CiteXplore:24752928"^^xsd:string)
AnnotationAssertion(Annotation(rdfs:label "CAS Registry Number"^^xsd:string) <http://www.geneontology.org/formats/oboInOwl#hasDbXref> <http://purl.obolibrary.org/obo/CHEBI_82978> "ChemIDplus:144598-75-4"^^xsd:string)

Annotation(rdfs:label "CAS Registry Number"^^xsd:string)
Annotation(rdfs:label "PubMed citation"^^xsd:string)
Annotation(rdfs:label "Reaxys Registry Number"^^xsd:string)
Annotation(rdfs:label "paliperidone"^^xsd:string)
Annotation(rdfs:label "HMDB"^^xsd:string)
Annotation(rdfs:label "Wikipedia"^^xsd:string)
Annotation(rdfs:label "DrugBank"^^xsd:string)
Annotation(rdfs:label "KEGG DRUG"^^xsd:string)

The desired output, i.e., paliperidone, is generated from a name tag rather than a dbxref, as far as I understand. So the current parser provides solution (2) already, if I understand it properly - the issue is that it is not possible to distinguish easily where the annotations came from.

It would be fairly straightforward to turn the annotations into annotations on the axioms being produced, rather than added as annotations on the entity - this would not bring them up when looking for annotations on the entity.

I'm not clear if this is a regression in the parser or an enhancement - from your description of the spec, the current behaviour is not incorrect. I'm mentioning this because of a separate conversation with @sesuncedu where a regression in OBO parsing between OWLAPI 3 and OWLAPI 4 was mentioned, and I wonder whether this is it.

cmungall commented 9 years ago

OK, first I need to retract what I said lest I cause (more) confusion. The behavior in 3.5.X was (3) The quoted strings are treated as axiom annotations on the dbxref annotation. This is what you are observing Ignazio (using owlapi 3.5.x or 4.x?). On reflection, I think this is in fact fine and good behavior, and we should document this is being standard in the spec. (current behavior seems to drop dbxref descriptions when used in a definition context and I think this is actually good too, as it prevents the awful double nesting of axiom annotations issue).

So it seems maybe something reverted in going to owlapi 4, although I'm not sure I understand Rob's program... I suggest I spec/clarify the desired behavior in a branch first

ignazio1977 commented 9 years ago

The output I've shown is from 4.1 RC3. I'll run the same code with 3.5.2 to see if there is any difference.

leechuck commented 9 years ago

IMHO, the problem is that the annotations are returned as rdfs:labels of the class http://purl.obolibrary.org/obo/CHEBI_82978 when querying for the class' labels (with EntitySearcher.getAnnotations(cl, ontology, df.getRDFSLabel())). Solution (3) would be the behavior I expect; maybe I do not understand correctly what EntitySearcher.getAnnotations(cl, ontology, df.getRDFSLabel()) does, but I don't think it would/should return the annotations of the annotation axioms (only the value of the name: tag should be returned as rdfs:label).

ignazio1977 commented 9 years ago

I've checked with 3.5.2, and the behaviour when looking for annotations only on the entity is as expected. (3) is implemented in both 3.5.2 and 4.x, the issue here is that EntitySearcher is selecting the annotation assertion axioms for the entity and collecting both the annotation that is the object of each axiom and the annotations /on/ the axiom itself. This is an useful behavior in some cases but not all, as in this case. I'll add a separate method to collect only the annotation object of the assertion axioms.