parsing monarch ontology with fastobo fails

monarch-initiative / monarch-ontology

Top level monarch importer ontology

4 stars 3 forks source link

parsing monarch ontology with fastobo fails #39

Open sierra-moxon opened 2 years ago

sierra-moxon commented 2 years ago

    def parse_from(self, handle, threads=None):
        # Load the OBO graph into a syntax tree using fastobo
>       doc = fastobo.load_graph(handle).compact_ids()
E         File "<stdin>", line 1
E           https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:9462
E                                                              ^
E       SyntaxError: remaining input

Possible to update the id here to not have that #! ? from @cmungall Could be that this needs to be fixed upstream in prefixcommons.

Ran into this in using oaklib to parse and walk this ontology in hopes of better assigning biolink-model categories in KGX and for monarch-ingest. In parallel, I'll try another parsing implementation in oaklib.

matentzn commented 2 years ago

I think it is not worth trying to fix monarch.owl. We need to make a concerted push to PHENIO instead. But this issue will still have to be dealt with!

In the meantime: best not (ever) use monarch.owl in OBO format. use obographs.

sierra-moxon commented 2 years ago

I am actually getting this error from this file (in obo json format): https://ci.monarchinitiative.org/job/monarch-ontology-json-sri/lastSuccessfulBuild/artifact/build/monarch-ontology-final.json

matentzn commented 2 years ago

That is then IMO a bug with fastobo.. This is a valid IRI - its just not a valid OBO id! @althonos, do you agree?

althonos commented 2 years ago

Yes, looks like a bug in the syntax, the IRI is definitely valid. When trying to force parsing an IRI using the current PEG grammar it will not parse the fragment part even though it should. The ! is always a problem in OBO because it's used as a comment character, but is valid in IRI context...

In any case though, the fastobo-graphs implementation shouldn't attempt to parse IRIs as OBO IDs because they are not going to be in escaped form in the JSON I suppose. I'll have a look when I can (currently in the final stretch of an unrelated manuscript)!