Skip invalid lines when converting out of OBO

cthoyt commented 2 years ago

The Cellosaurus ontology contains many invalid lines, e.g. the following line has improperly escaped curly braces in the molecule's name:

comment: "Group: Patented cell line. Registration: International Depositary Authority, China Center for Type Culture Collection; CCTCC C2014222. Monoclonal antibody isotype: IgG1, kappa. Monoclonal antibody target: ChEBI; CHEBI:144925; 1-(4-methoxyphenyl)-2-{[4-(4-nitrophenyl)butan-2-yl]amino}ethanol (Phenylethylamine A)."

If you run robot convert -I https://ftp.expasy.org/databases/cellosaurus/cellosaurus.obo -o ~/Desktop/cellosaurus.json -vvv and look very carefully for the relevant error (for now, you have to search the output for org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser - #1038 would be helpful for this), you find that:

LINENO: 29219 - Missing '=' in trailing qualifier block. This might happen for not properly escaped '{', '}' chars in comments.
LINE: comment: "Monoclonal antibody isotype: IgG2a, kappa. Monoclonal antibody target: ChEBI; CHEBI:144925; Phenylethylamine A (1-(4-methoxyphenyl)-2-{[4-(4-nitrophenyl)butan-2-yl]amino}ethanol)."        org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser.parse(OBOFormatOWLAPIParser.java:60)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyFactoryImpl.loadOWLOntology(OWLOntologyFactoryImpl.java:220)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.actualParse(OWLOntologyManagerImpl.java:1254)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntology(OWLOntologyManagerImpl.java:1208)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntologyFromOntologyDocument(OWLOntologyManagerImpl.java:1165)
        org.obolibrary.robot.IOHelper.loadOntology(IOHelper.java:531)
        org.obolibrary.robot.IOHelper.loadOntology(IOHelper.java:417)
        org.obolibrary.robot.IOHelper.loadOntology(IOHelper.java:298)
        org.obolibrary.robot.CommandLineHelper.getInputOntology(CommandLineHelper.java:487)
        org.obolibrary.robot.CommandLineHelper.updateInputOntology(CommandLineHelper.java:585)

This ontology doesn't do its curation in an open source way so it's difficult to communicate and help solve this issue. Further, I downloaded the file and started making fixes one at a time, but I have to re-run robot convert on every step. It would be nice if there were a setting that allowed for invalid lines to be skipped on OBO parsing.

CC @AmosBairoch @lubianat

Update: this is the same underlying issue as https://github.com/ebi-chebi/ChEBI/issues/4273

matentzn commented 2 years ago

Hmm.. I think this is outside of the scope of ROBOT.. If you want this to happen you have to go through https://github.com/owlcs/owlapi/issues/ or join the #obo-format channel on OBO slack where @balhoff is currently thinking about prefix maps for OBO format and other fixes - he may be amenable to this. But a ROBOT issue per se this is not I don't think - if the raw data is broken, the tool cant be expected to deal with all eventualities, so I would simple run a grep -v on the OBO file prior to parsing. If you agree, can you close the issue?

balhoff commented 2 years ago

This exact issue is a problem with the currently released ChEBI OBO file: https://github.com/ebi-chebi/ChEBI/issues/4273

matentzn commented 1 year ago

Rethinking this now: I could implement a "repair --obo-format" option that deals with the most frequent violations like multiple labels and multiple comments etc.. I would be open to this but it would have to be now!

matentzn commented 1 year ago

Sorry, I now realise I discuss this here: https://github.com/ontodev/robot/issues/995 and that this (broken rows) is not possible at all right now without a major OWLAPI update.

This needs to be either added as an OWL API ticket, or oboformat.. https://github.com/owlcollab/oboformat/issues

I will close this now, as what ROBOT can do about this can be covered by #995

ontodev / robot

Skip invalid lines when converting out of OBO #1039