Closed phillord closed 7 months ago
This seems to be a problem with quick-xml :-(
I raised an issue about that tafia/quick-xml#258 .
@Tpt, since this affects rio_xml
, you might be interested in following that issue.
Thanks @phillord for spotting this.
I believe that quick-xml
does not parse doctypes and so does not extract entities. It is definitely possible to add doctype entities support to rio_xml
but it would require quite a lot of work.
@phillord The quickest way to go for you seems to replace the file entities with XML namespaces.
@Tpt
The quickest way to go for you seems to replace the file entities with XML namespaces.
It is not so simple... The trick of using namespace-like entities is very common in RDF/XML, because there are a lot of places where IRIs have to be provided in text rather than tag or attribute names... The rdf:datatype
in @phillord 's example illustrates that...
I don't know if quick-xml allows to add custom entities to a a Reader. If so, a quick and dirty workaround could be to preemptively add common namespaces as entities... But I don't see any such feature in the documentation ;-(
I don't know if quick-xml allows to add custom entities to a a Reader.
I don't think it's possible. quick-xml only supports the default entities for special character escaping (<, >...). It would be great to add this feature as part of quick-xml or write a small library that allows entities encoding/decoding and parses the doctype.
@Tpt It's not my file, so I can't replace anything in it. Either it works or it doesn't.
Unfortunately neither quick-xml nor xml-rs appear to be support doctype entity definitions. Understandably so, I guess, as doctype declarations seem a bit of a hangover from the past. But they are being used as this example, albeit a slightly old one, shows.
Unfortunately neither quick-xml nor xml-rs appear to be support doctype entity definitions.
Yes, it would be great to have an XML library compatible with doctypes. This would allow better compatbility with old files indeed.
FTR, I submitted a PR to quick-xml
(https://github.com/tafia/quick-xml/pull/261) which I believe hits a sweet spot:
quick-xml
's API: when unescaping a text or attriute, one can now pass a map of custom entity definitions;rio_xml
).However, to solve this issue, we don't need to parse the DOCTYPE entirely. A naive extraction of internal entity definitions should cover 99.99% of RDF/XML files. I have a POC example program in my PR:
https://github.com/pchampin/quick-xml/blob/custom-entities/examples/custom_entities.rs
I have just added entities support to Rio: https://github.com/oxigraph/rio/commit/bb81f95d5cdf6dfcd278d92a2d51bf154a166fb5
@phillord I just pushed the branch tmp-xml-entities, which depends on @Tpt's github version of rio
. This adds support for entities in RDF/XML into Sophia.
I will not merge this branch as is, but rather wait for a stable version of Rio
. However, in the meantime, you can have your own code depend on this branch.
The improvement to quick-xml
and rio
are now part of Sophia (since a long time ago, actually).
I am trying to parse this file
http://www.drugtargetontology.org/dto/dto_vocabulary_gpcr_protein.owl
And getting an error!
The problem appears to be the use of an xsd entity.
The entity appears to be defined correctly. Is this expected?