pchampin / sophia_rs

Sophia: a Rust toolkit for RDF and Linked Data
Other
210 stars 23 forks source link

RDFXML parser fails on xsd entities #98

Closed phillord closed 7 months ago

phillord commented 3 years ago

I am trying to parse this file

http://www.drugtargetontology.org/dto/dto_vocabulary_gpcr_protein.owl

And getting an error!

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: SourceError(RdfXmlError { kind: Xml(EscapeError(UnrecognizedSymbol(1..4, Ok("xsd")))) })', src/io/rdf/reader.rs:323:56

The problem appears to be the use of an xsd entity.

<rdfs:label rdf:datatype="&xsd;string">HTR1A gene</rdfs:label>

The entity appears to be defined correctly. Is this expected?

pchampin commented 3 years ago

This seems to be a problem with quick-xml :-( I raised an issue about that tafia/quick-xml#258 . @Tpt, since this affects rio_xml, you might be interested in following that issue.

Thanks @phillord for spotting this.

Tpt commented 3 years ago

I believe that quick-xml does not parse doctypes and so does not extract entities. It is definitely possible to add doctype entities support to rio_xml but it would require quite a lot of work.

@phillord The quickest way to go for you seems to replace the file entities with XML namespaces.

pchampin commented 3 years ago

@Tpt

The quickest way to go for you seems to replace the file entities with XML namespaces.

It is not so simple... The trick of using namespace-like entities is very common in RDF/XML, because there are a lot of places where IRIs have to be provided in text rather than tag or attribute names... The rdf:datatype in @phillord 's example illustrates that...

I don't know if quick-xml allows to add custom entities to a a Reader. If so, a quick and dirty workaround could be to preemptively add common namespaces as entities... But I don't see any such feature in the documentation ;-(

Tpt commented 3 years ago

I don't know if quick-xml allows to add custom entities to a a Reader.

I don't think it's possible. quick-xml only supports the default entities for special character escaping (<, >...). It would be great to add this feature as part of quick-xml or write a small library that allows entities encoding/decoding and parses the doctype.

phillord commented 3 years ago

@Tpt It's not my file, so I can't replace anything in it. Either it works or it doesn't.

Unfortunately neither quick-xml nor xml-rs appear to be support doctype entity definitions. Understandably so, I guess, as doctype declarations seem a bit of a hangover from the past. But they are being used as this example, albeit a slightly old one, shows.

Tpt commented 3 years ago

Unfortunately neither quick-xml nor xml-rs appear to be support doctype entity definitions.

Yes, it would be great to have an XML library compatible with doctypes. This would allow better compatbility with old files indeed.

pchampin commented 3 years ago

FTR, I submitted a PR to quick-xml (https://github.com/tafia/quick-xml/pull/261) which I believe hits a sweet spot:

However, to solve this issue, we don't need to parse the DOCTYPE entirely. A naive extraction of internal entity definitions should cover 99.99% of RDF/XML files. I have a POC example program in my PR:

https://github.com/pchampin/quick-xml/blob/custom-entities/examples/custom_entities.rs

Tpt commented 3 years ago

I have just added entities support to Rio: https://github.com/oxigraph/rio/commit/bb81f95d5cdf6dfcd278d92a2d51bf154a166fb5

pchampin commented 3 years ago

@phillord I just pushed the branch tmp-xml-entities, which depends on @Tpt's github version of rio. This adds support for entities in RDF/XML into Sophia. I will not merge this branch as is, but rather wait for a stable version of Rio. However, in the meantime, you can have your own code depend on this branch.

pchampin commented 7 months ago

The improvement to quick-xml and rio are now part of Sophia (since a long time ago, actually).