ontodev / robot

ROBOT is an OBO Tool
http://robot.obolibrary.org
BSD 3-Clause "New" or "Revised" License
259 stars 71 forks source link

Escape Chars break Jena when querying #662

Open beckyjackson opened 4 years ago

beckyjackson commented 4 years ago

When querying CMO:

2020-03-27 08:28:28,218 ERROR org.apache.jena.riot - [line: 131, col: 107] Illegal unicode escape sequence value: \: (0x3A)
[line: 131, col: 107] Illegal unicode escape sequence value: \: (0x3A)

It looks like the problem is coming from this xref:

http://purl.obolibrary.org/obo/MedicineNet#_http\://www.medicinenet.com

I'll look into this, just wanted to make an issue to keep track of it.

cmungall commented 4 years ago

Looks like it is coming from this OBO

[Term]
id: CMO:0000026
name: blood hemoglobin level
def: "The amount of hemoglobin in a specific volume of blood, expressed as grams per deciliter of whole blood in humans." [MedicineNet:http\://www.medicinenet.com]
synonym: "blood haemoglobin level" EXACT []
is_a: CMO:0000028 ! blood protein measurement

so following obo2owl this is expanded to:

http://purl.obolibrary.org/obo/MedicineNet#_http\://www.medicinenet.com

If this is truly an invalid URL then the OWLAPI should not emit this, or should employ additional escaping. If it is valid then the problem is with Jena. Either way I am pinning this on one of these two libs :-)

But regardless we will need some kind of short term workaround. One is just to discourage this kind of axiom annotation. I am not a fan of it, and prefer using a plan URL when a URL will do. We could send CMO a patch to do this.

And/or: we could consider this an aspect of http://obofoundry.org/principles/fp-002-format.html - explicitly have roundtrips owlapi->jena->owlapi, and if it fails consider it a violation, which would help nudge people to have a fix.