Closed ekulno closed 2 years ago
As this is standard XML encoding behaviour, this looks like intended behaviour to me. I quickly checked with some other RDF/XML parsers, and these seem to be doing the same here.
If you want encoded characters in your parsed outputs, I would suggest double encoding of these characters. I suspect existing serializers would to this automatically.
@rubensworks I think you're correct. There is an RDF/XML test case where an ampersand (&
) is encoded in the RDF/XML input file, and is decoded in the N-Triples output file: https://www.w3.org/2013/RDFXMLTests/amp-in-url/
However, this does not immediately solve our problem: IIUC there are valid RDF/XML files that do not encode valid RDF graphs. Specifically, an RDF/XML file is allowed to encode characters that violate the abstract syntax rules for RDF terms.
I've asked this at the appropriate W3C mailing list: https://lists.w3.org/Archives/Public/public-rdf-comments/2020Jul/0000.html
Hmm, your point on the unescaped newline makes me suspect that may in fact may be something a parser should check (and error on). But let's await the response on the mailing list.
Btw, I have noticed in other specs (and their test suites) that IRI validation usually isn't checked very strictly, or even not at all.
@rubensworks I indeed believe that RDF parsers must also -- at least to some extent -- check for IRI validity, otherwise valid RDF serialization documents can encode invalid RDF graphs.
Also, several serialization formats require that parsers resole relative IRIs, which is not possible without -- at least to some extent -- validating the IRI syntax. See https://lists.w3.org/Archives/Public/semantic-web/2018Mar/0016.html for a prior discussion of this.
IMO people who hold that IRI validation is not part of RDF parsing have the following problems:
From @cygri on the W3C mailing list:
I can't find any rationale for ignoring the character reference. And the referenced character is not allowed in an IRI. This would make the document not valid RDF/XML.
Ok, so validating IRIs and throwing an error on invalid ones seems like a good solution. I'd immeditiately apply this same check for all my parsers. Given the performance overhead, making this disableable is probably also a good idea.
As discussed with @rubensworks, I will work on this issue via the Comunica Association.
Probably superfluous, but this is still an issue in version 2.1.0
As discussed with @rubensworks, I will work on this issue via the Comunica Association (pending approval from Triply).
@Tpt Thanks! You certainly have Triply's approval :-)
Thanks to @Tpt's work in https://github.com/rdfjs/rdfxml-streaming-parser.js/pull/64, v2.2.0 now implements the new validation logic.
@wouterbeek can you confirm on your end that this resolves this bounty?
Thanks for fixing this @Tpt and @rubensworks ! @Ysgorg who originally reporting this bug has checked the fix.
@wouterbeek Thanks for checking! I'll ask internally to initiate the invoicing process.
Hi, I have a rdf-xml file where an IRI contains the character sequence


, which is a URL encoding for newlines (\n
). In the output of rdfxml-streaming-parser, this string is decoded, so that my IRI now instead contains\n
. The same can be seen for other strings such as>
and<
. This is different from what N3 does for turtle-family parsing. I'm not certain which approach would be correct.input files:
output:
Bounty
A bounty has been placed on this issue by:
Click here to learn more if you're interested in claiming this bounty by resolving this issue.