rdfjs / rdfxml-streaming-parser.js

Streaming RDF/XML parser
https://www.rubensworks.net/blog/2019/03/13/streaming-rdf-parsers/
MIT License
24 stars 8 forks source link

URL encoded strings are decoded in IRIs #39

Closed ekulno closed 2 years ago

ekulno commented 4 years ago

Hi, I have a rdf-xml file where an IRI contains the character sequence 
, which is a URL encoding for newlines (\n). In the output of rdfxml-streaming-parser, this string is decoded, so that my IRI now instead contains \n. The same can be seen for other strings such as > and <. This is different from what N3 does for turtle-family parsing. I'm not certain which approach would be correct.

const fs = require('fs');
const RdfXmlParser = require("rdfxml-streaming-parser").RdfXmlParser;
const N3 = require('n3');

fs.createReadStream('test.rdf')
  .pipe(new RdfXmlParser())
  .on('data', console.log)

fs.createReadStream('test.ttl')
  .pipe(new N3.StreamParser())
  .on('data', console.log)

input files:

<?xml version="1.0" encoding="utf-8" ?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:ns0="b:">

  <rdf:Description rdf:about="a:&#xA;">
    <ns0:b rdf:resource="c:c"/>
  </rdf:Description>

</rdf:RDF>
<a:&#xA;><b:b><c:c>.

output:

Quad {
  subject: NamedNode { value: 'a:\n' },
  predicate: NamedNode { value: 'b:b' },
  object: NamedNode { value: 'c:c' },
  graph: DefaultGraph { value: '' }
}
Quad {
  subject: NamedNode { id: 'a:&#xA;' },
  predicate: NamedNode { id: 'b:b' },
  object: NamedNode { id: 'c:c' },
  graph: DefaultGraph { id: '' }
}

Bounty

A bounty has been placed on this issue by:

Triply
€544

Click here to learn more if you're interested in claiming this bounty by resolving this issue.

rubensworks commented 4 years ago

As this is standard XML encoding behaviour, this looks like intended behaviour to me. I quickly checked with some other RDF/XML parsers, and these seem to be doing the same here.

If you want encoded characters in your parsed outputs, I would suggest double encoding of these characters. I suspect existing serializers would to this automatically.

wouterbeek commented 4 years ago

@rubensworks I think you're correct. There is an RDF/XML test case where an ampersand (&) is encoded in the RDF/XML input file, and is decoded in the N-Triples output file: https://www.w3.org/2013/RDFXMLTests/amp-in-url/

However, this does not immediately solve our problem: IIUC there are valid RDF/XML files that do not encode valid RDF graphs. Specifically, an RDF/XML file is allowed to encode characters that violate the abstract syntax rules for RDF terms.

wouterbeek commented 4 years ago

I've asked this at the appropriate W3C mailing list: https://lists.w3.org/Archives/Public/public-rdf-comments/2020Jul/0000.html

rubensworks commented 4 years ago

Hmm, your point on the unescaped newline makes me suspect that may in fact may be something a parser should check (and error on). But let's await the response on the mailing list.

Btw, I have noticed in other specs (and their test suites) that IRI validation usually isn't checked very strictly, or even not at all.

wouterbeek commented 4 years ago

@rubensworks I indeed believe that RDF parsers must also -- at least to some extent -- check for IRI validity, otherwise valid RDF serialization documents can encode invalid RDF graphs.

Also, several serialization formats require that parsers resole relative IRIs, which is not possible without -- at least to some extent -- validating the IRI syntax. See https://lists.w3.org/Archives/Public/semantic-web/2018Mar/0016.html for a prior discussion of this.

IMO people who hold that IRI validation is not part of RDF parsing have the following problems:

  1. They must admit that valid RDF documents may encode invalid RDF graphs.
  2. They must somehow satisfy the requirement of relative IRI resolution for invalid IRIs.
  3. They must employ an IRI validator component between their RDF parser and RDF loading components. (In practice, I have never seen such an IRI validator component.)
wouterbeek commented 4 years ago

From @cygri on the W3C mailing list:

I can't find any rationale for ignoring the character reference. And the referenced character is not allowed in an IRI. This would make the document not valid RDF/XML.

rubensworks commented 4 years ago

Ok, so validating IRIs and throwing an error on invalid ones seems like a good solution. I'd immeditiately apply this same check for all my parsers. Given the performance overhead, making this disableable is probably also a good idea.

danielbeeke commented 3 years ago

As discussed with @rubensworks, I will work on this issue via the Comunica Association.

LaurensRietveld commented 2 years ago

Probably superfluous, but this is still an issue in version 2.1.0

Tpt commented 2 years ago

As discussed with @rubensworks, I will work on this issue via the Comunica Association (pending approval from Triply).

wouterbeek commented 2 years ago

@Tpt Thanks! You certainly have Triply's approval :-)

rubensworks commented 2 years ago

Thanks to @Tpt's work in https://github.com/rdfjs/rdfxml-streaming-parser.js/pull/64, v2.2.0 now implements the new validation logic.

@wouterbeek can you confirm on your end that this resolves this bounty?

wouterbeek commented 2 years ago

Thanks for fixing this @Tpt and @rubensworks ! @Ysgorg who originally reporting this bug has checked the fix.

rubensworks commented 2 years ago

@wouterbeek Thanks for checking! I'll ask internally to initiate the invoicing process.