Closed k00ni closed 8 months ago
Just out of curiosity - what's the subject created by the ARC2 parser for the <> rdf:type mf:Manifest
triple?
The problem is the <>
subject. By the turtle specification it's a relative IRI. As a relative IRI it should be resolved using the base IRI. The problem is the base IRI is not set (the @prefix : <http://www.w3.org/2009/sparql/docs/tests/data-sparql11/syntax-update-1/manifest#> .
sets the default prefix for prefixed names but not the base IRI for relative IRIs; btw it means parsing works if you add @base <http://www.w3.org/2009/sparql/docs/tests/data-sparql11/syntax-update-1/manifest#> .
to the file).
I guess it is expected to work the way that when a remote RDF dataset is being read, the reader should set the base IRI to the remote document location but as the TriGParser does not know the source is a remote dataset, it's on you to include the 'documentIRI' => 'https://www.w3.org/2009/sparql/docs/tests/data-sparql11/syntax-update-1/manifest'
in the TriGParser
constructor options.
Still, we should expect the TriGParser
either to throw an error when the result of an IRI resolution is empty (because as you shown it harms further processing) or to have some base IRI default.
Funnily both solutions seems to be standards-conformant. For the definition of the base IRI the RDF concepts redirects us to the RFC3986 which says there are four possible sources for the base URI (and first available should be used):
@base/BASE
turtle/SPARQL directive in an RDF documentIf none of the conditions described above apply, then the base URI is defined by the context of the application. As this definition is necessarily application-dependent, failing to define a base URI by using one of the other methods may result in the same content being interpreted differently by different types of applications.
Discussed it with @k00ni on the side and he's in favour of using a default base URI as a fallback. A pull request will follow shortly.
The base URI MUST be configurable in the parser and should be the last URI after redirection of the HTTP fetch that took place. If this file just comes from disk, we should expect an @base
somewhere. If there’s no @base
, probably the base IRI I believe should be the current file path.
I would not use a default base URI hardcoded in the library
We should probably add this as a test-case too!
We never considered fixing it to a single value. The only option considered was the last point of the RFC3986 cited below - a fallback when no better option is available. That being said on the TriGParser
level the source is just a string so if there is no documentIRI
set, we have no way to use a URL from which the data was downloaded or a file name as we just don't have access to this kind of information.
Anyway after some discussion with @k00ni I convinced him to throw an error in such a case with the message explaining the data can't be parse because the base URI is unknown and that in should be set with the documentIRI
parser constructor option.
One more solution to consider. The current problem is caused by the non-strict type comparisons of the TriGParser::$readEntity()
callback return value with null
(in which case an empty string equals null). This can be solved by turning these comparisons into type-strict ones (===
instead of ==
). As a result the parser will parse <> <a> "b" .
as ['', 'a', '"b"']
). What do you think?
I personally think throwing an error explaining the lack of the base URI is still the best solution.
In fact the TriGParserTest::testBlankNodes()
starts with a test of parsing <>
with lacking base IRI as empty subject, predicate, object and graph.
Funnily (or not) it works there but a document like <> <b> <c> .
(or the one @k00ni is mentioning in this issue) turns the parser into undefined state.
Should I investigate it or can we assume the test should be adjusted to expect the error to be thrown there? (btw <>
are relative IRIs so I'm not sure why they are tested in the testBlankNodes()
)
Because of various PHP-magic, I would be in favor of using strict-comparisons (===
) wherever we can.
Funnily (or not) it works there but a document like
<> <b> <c>
. (or the one @k00ni is mentioning in this issue) turns the parser into undefined state. Should I investigate it or can we assume the test should be adjusted to expect the error to be thrown there? (btw <> are relative IRIs so I'm not sure why they are tested in the testBlankNodes())
If our current tests don't reflect the specified behavior, they have to be adapted accordingly. I am not sure what you mean with "turns parser into undefined state". We should aim for a reliable solution which we understand and control. I have no problem if the parser acts weird (at least for a while) in some edge cases which no one uses.
However, I really appreciate you taking the time here @zozlak.
I am not sure what you mean with "turns parser into undefined state"
The TriGParser
is implemented as a state machine but in a little funny way (which I assume just follows the original JS library implementation) - the current state is hold as a property storing the state handling callable and updated with this callable return value. Something like:
$curState = $this->parseTopLevel;
foreach ($lexems as $i) {
$curState = $curState($i);
}
Our issue was for an empty relative IRI with no document base IRI the callable handling an entity parsing was returning NULL
, resulting in $curState
being set to NULL
, resulting in no further processing being done (as you surely noticed in practice it's a little more subtle as $curState = NULL; $curState($lexem);
would throw an PHP error but I hope you get the general idea). When I'm telling about "turning parser into undefined state", I'm talking about ending up with $curState = NULL
("escaping" ;-) the parsing state machine).
Because of various PHP-magic, I would be in favor of using strict-comparisons (===) wherever we can
I agree with this as a general statement but I don't see reviewing the hardf code base for that in the predictable future.
Could someone with enough rights (@pietercolpaert or @k00ni I think) make a release including the #42 merge, please?
Done.
@pietercolpaert When parsing turtle file https://www.w3.org/2009/sparql/docs/tests/data-sparql11/syntax-update-1/manifest parser returns 0 triples.
ARC2's Turtle parser has no problems and returns all triples, so file should be correct.
I made a test which shows the problem: https://github.com/pietercolpaert/hardf/blob/error/turtle-parser-fails-unknown/test/TriGParserTest.php#L2061-L2072 (branch error/turtle-parser-fails-unknown)
Any idea why?