Turtle parser fails for unknown reason (SPARQL 1.1 Syntax Update 1 manifest.ttl)

k00ni commented 3 years ago

@pietercolpaert When parsing turtle file https://www.w3.org/2009/sparql/docs/tests/data-sparql11/syntax-update-1/manifest parser returns 0 triples.

ARC2's Turtle parser has no problems and returns all triples, so file should be correct.

I made a test which shows the problem: https://github.com/pietercolpaert/hardf/blob/error/turtle-parser-fails-unknown/test/TriGParserTest.php#L2061-L2072 (branch error/turtle-parser-fails-unknown)

Any idea why?

zozlak commented 9 months ago

Just out of curiosity - what's the subject created by the ARC2 parser for the <> rdf:type mf:Manifest triple?

The problem is the <> subject. By the turtle specification it's a relative IRI. As a relative IRI it should be resolved using the base IRI. The problem is the base IRI is not set (the @prefix : <http://www.w3.org/2009/sparql/docs/tests/data-sparql11/syntax-update-1/manifest#> . sets the default prefix for prefixed names but not the base IRI for relative IRIs; btw it means parsing works if you add @base <http://www.w3.org/2009/sparql/docs/tests/data-sparql11/syntax-update-1/manifest#> . to the file).

I guess it is expected to work the way that when a remote RDF dataset is being read, the reader should set the base IRI to the remote document location but as the TriGParser does not know the source is a remote dataset, it's on you to include the 'documentIRI' => 'https://www.w3.org/2009/sparql/docs/tests/data-sparql11/syntax-update-1/manifest' in the TriGParser constructor options.

zozlak commented 9 months ago

Still, we should expect the TriGParser either to throw an error when the result of an IRI resolution is empty (because as you shown it harms further processing) or to have some base IRI default.

Funnily both solutions seems to be standards-conformant. For the definition of the base IRI the RDF concepts redirects us to the RFC3986 which says there are four possible sources for the base URI (and first available should be used):

Base URI Embedded in Content which in the RDF context would be the @base/BASE turtle/SPARQL directive in an RDF document
Base URI from the Encapsulating Entity but I can not think of any RDF world equivalence for that
Base URI from the Retrieval URI which is self-descriptive (and this is a solution I mentioned in the previous comment)
Default Base URI - let's just cite the RFC as it nicely explains the trade-off between having and not having it:

If none of the conditions described above apply, then the base URI is defined by the context of the application. As this definition is necessarily application-dependent, failing to define a base URI by using one of the other methods may result in the same content being interpreted differently by different types of applications.

zozlak commented 8 months ago

Discussed it with @k00ni on the side and he's in favour of using a default base URI as a fallback. A pull request will follow shortly.

pietercolpaert commented 8 months ago

The base URI MUST be configurable in the parser and should be the last URI after redirection of the HTTP fetch that took place. If this file just comes from disk, we should expect an @base somewhere. If there’s no @base, probably the base IRI I believe should be the current file path.

I would not use a default base URI hardcoded in the library

pietercolpaert commented 8 months ago

We should probably add this as a test-case too!

zozlak commented 8 months ago

We never considered fixing it to a single value. The only option considered was the last point of the RFC3986 cited below - a fallback when no better option is available. That being said on the TriGParser level the source is just a string so if there is no documentIRI set, we have no way to use a URL from which the data was downloaded or a file name as we just don't have access to this kind of information.

Anyway after some discussion with @k00ni I convinced him to throw an error in such a case with the message explaining the data can't be parse because the base URI is unknown and that in should be set with the documentIRI parser constructor option.

zozlak commented 8 months ago

One more solution to consider. The current problem is caused by the non-strict type comparisons of the TriGParser::$readEntity() callback return value with null (in which case an empty string equals null). This can be solved by turning these comparisons into type-strict ones (=== instead of ==). As a result the parser will parse <> <a> "b" . as ['', 'a', '"b"']). What do you think?

I personally think throwing an error explaining the lack of the base URI is still the best solution.

zozlak commented 8 months ago

In fact the TriGParserTest::testBlankNodes() starts with a test of parsing <> with lacking base IRI as empty subject, predicate, object and graph.

Funnily (or not) it works there but a document like <> <b> <c> . (or the one @k00ni is mentioning in this issue) turns the parser into undefined state.

Should I investigate it or can we assume the test should be adjusted to expect the error to be thrown there? (btw <> are relative IRIs so I'm not sure why they are tested in the testBlankNodes())

k00ni commented 8 months ago

Because of various PHP-magic, I would be in favor of using strict-comparisons (===) wherever we can.

Funnily (or not) it works there but a document like <> <b> <c>. (or the one @k00ni is mentioning in this issue) turns the parser into undefined state. Should I investigate it or can we assume the test should be adjusted to expect the error to be thrown there? (btw <> are relative IRIs so I'm not sure why they are tested in the testBlankNodes())

If our current tests don't reflect the specified behavior, they have to be adapted accordingly. I am not sure what you mean with "turns parser into undefined state". We should aim for a reliable solution which we understand and control. I have no problem if the parser acts weird (at least for a while) in some edge cases which no one uses.

However, I really appreciate you taking the time here @zozlak.

zozlak commented 8 months ago

I am not sure what you mean with "turns parser into undefined state"

The TriGParser is implemented as a state machine but in a little funny way (which I assume just follows the original JS library implementation) - the current state is hold as a property storing the state handling callable and updated with this callable return value. Something like:

$curState = $this->parseTopLevel;
foreach ($lexems as $i) {
  $curState = $curState($i);
}

Our issue was for an empty relative IRI with no document base IRI the callable handling an entity parsing was returning NULL, resulting in $curState being set to NULL, resulting in no further processing being done (as you surely noticed in practice it's a little more subtle as $curState = NULL; $curState($lexem); would throw an PHP error but I hope you get the general idea). When I'm telling about "turning parser into undefined state", I'm talking about ending up with $curState = NULL ("escaping" ;-) the parsing state machine).

zozlak commented 8 months ago

Because of various PHP-magic, I would be in favor of using strict-comparisons (===) wherever we can

I agree with this as a general statement but I don't see reviewing the hardf code base for that in the predictable future.

zozlak commented 4 months ago

Could someone with enough rights (@pietercolpaert or @k00ni I think) make a release including the #42 merge, please?

k00ni commented 4 months ago

Done.

pietercolpaert / hardf

Turtle parser fails for unknown reason (SPARQL 1.1 Syntax Update 1 manifest.ttl) #37