cannot parse dicom_ontology.owl

incf-nidash / dicom-ontology

This repository contains the DICOM ontology used by the INCF-NIDASH NIDM-Experiment project.

2 stars 4 forks source link

cannot parse dicom_ontology.owl #14

Open tgbugs opened 6 years ago

tgbugs commented 6 years ago

@khelm I was trying to parse dicom_ontology.owl with my usual suite of ttl parsers (rapper among others gives pretty good debug info) and noticed that there are multiple cases where the descriptions are malformed. I think all that needs to be changed is to add an additional cleaning and proper escaping rules after https://github.com/incf-nidash/dicom-ontology/blob/e3d7e4757c01baf2b07e758503a9363bbfde4522/create_dicom_ttl.0.4.py#L336 (I don't have access to the source data files or I would try it myself). I think running it through rdflib.Literal or json.dumps may be sufficient.

Some examples of issues.

Strange <200b> char at the end of every definition (I have to open it in vim to see this).
Internal double quotes are not escaped.
The backslash char \ is not escaped so parsers try to interpret things like -1\-1 as an escape sequence and fail.

An incomplete set of fixes with the examples (as a patch). ontdiff.txt

khelm commented 6 years ago

Hi @tgbugs - 1) 200b is a zero-width non-printing character. Is that actually causing the ttl parser to crash or just an oddity? It should be easy enough to filter out. 2) The DICOM standard has a couple of issues in the Descriptions and Attribute Names. The first is that there are some Attribute Names that have apostrophes in the ("Physician's Name), so using double quotes for the string keeps the quote order from getting out of sync. But, as you found, the Descriptions also have unescaped double quotes in them and backslashes as well. I will look into separate code to escape those characters.

Also, I have uploaded a similar python dict file that includes the tag value and the definition/notes text from the DICOM XML docbook. Try running your units-detecting code on that file and see how it goes. I kept the utf-8 encoding so that things like the mu and degrees symbols were still intact. This is not true in the current owl file in which I substituted u's for mu's.

tgbugs commented 6 years ago

I don't think the zero width is what was causing the parsing error (and it is a simple s/<200b>//g fix).
Ok, sounds good. In the mean time I may have a way to automatically fix that issues using obo:IAO_0000115 and ; as delimiters.
Great, I will take a look.

mick-d commented 5 years ago

Hi, I would like to confirm that I could not parse the OWL file either trying several different tools (e.g. OWLGrEd, WebVOWL)

khelm commented 5 years ago

Thanks @mick-d , we're working on it. @tgbugs what is the solution using IAO 0000115? Did you have to do something to get OWL to recognize ";" as a delimiter? Are there any instance of that character in the definitions?

tgbugs commented 5 years ago

I reviewed my .bash_history file to see what I did, and unfortunately it looks like I made all the changes using vim's ex mode (:%s/a/b/) so I don't have a record of what I did. I didn't make any changes to the generating code.