Closed k00ni closed 5 months ago
My guess is it is a combination of the line length (almost 27k characters for the line throwing the error; it works once shortened to around 25k) and the complexity of the regex used to parse it (because the parser is able to parse the RDF star but the RDF star has no dedicated mime types the Util
class initializes the parser in the RDF star parsing mode) reach some preg library limits.
Now a good question is how much we want to generalize this particular example. The problem is in general there will be always a line size too long to be properly parsed with a regex. The question is only which size we consider sane and therefore want to implement workarounds for it. Available workarounds are (for sure incomplete, just what quickly comes to my mind):
Util
class so for some format strings (e.g. ntriples
/n-triples
) it initializes the parser in the n-triples mode and not in the n-triples star mode. The non-star regex is much simpler therefore it can deal with larger input.By the way it is quite suboptimal to store all non-ASCII characters as unicode escape sequences (5 times more characters and in most cases 2.5 more bytes) and if only they were normal UTF characters in the input, there would be no error. But this is nothing we can change.
I created the issue7 branch where I:
So the problem is caused by the literal value recognition pattern in the non-strict mode: "((?>([^"]|\\")*))"
(double quotes, then any number of: either everything-other-than-double-quotes or backslash-followed-by-double-quotes, then double quotes).
I see two workarounds:
"((?>[^\x{22}\x{5C}\x{0A}\x{0D}]|\\[tbnrf"'\\]|\\u[0-9A-Fa-f]{4}|\\U[0-9A-Fa-f]{8})*)"
) but does not cause trouble with the JIT (I guess the important difference is all strict mode alternatives [^\x{22}\x{5C}\x{0A}\x{0D}]
, \\[tbnrf"'\\]
, \\u[0-9A-Fa-f]{4}
and \\U[0-9A-Fa-f]{8}
are strictly disjoint while there is an overlap between [^"]
and \\"
in the non-strict pattern).preg_last_error() === PREG_JIT_STACKLIMIT_ERROR)
, turn the PCRE JIT off and parse again and then turn the JIT on again. Funnily it also works well.Let's implement the first approach and remember about the second if we encounter the issue again.
It is worth noting that simplifying the pattern to "((?>.*))"
will not work as a triple can be followed by a commend which can contain any characters including double quotes.
NQuadsParser fails for the following ontology:
It seems to fail if a triple looks like:
<http://nl.ijs.si/ME/owl/multext-east.owl#FutureTense> <http://www.w3.org/2000/01/rdf-schema#comment> "e.g., \u0430\u0431\u0435\u0442\u043A\u0443\u0432\u0430\u0442\u0438\u043C\u0435/\u0430\u0431\u0435\u0442\u043A\u0443\u0432\u0430\u0442\u0438 \u0430\u0432\u0456\u0437\u0443\u0432\u0430\u0442\u0438\u043C\u0435/\u0430\u0432\u0456\u0437\u0443\u0432\u0430\u0442\u0438 \u0430\u0433\u0430\u043A\u0430\u0442\u0438\u043C\u0435/\u0430\u0433\u0430\u043A\u0430\u0442\u0438 \u0430\u0433\u0456\u0442\u0443\u0432\u0430\u0442\u0438\u043C\u0435/\u0430\u0433\u0456\u0442\u0443\u0432\u0430\u0442\u0438 \u0430\u0433\u043E\u043D\u0456\u0437\u0443\u0432\u0430\u0442\u0438\u043C\u0435/\u0430\u0433\u043E\u043D\u0456\u0437\u0443\u0432\u0430\u0442\u0438 \u0430\u0434\u0432\u043E\u043A\u0430\u0442\u0443\u0432\u0430\u0442\u0438\u043C\u0435/\u0430\u0434\u0432\u043E\u043A\u0430\u0442\u0443\u0432\u0430\u0442\u0438 ..."
Parsing it results in an RdfIoException:
I prepared a test case so you can see for yourself:
It seems to be valid n-triples, because rapper is parsing it without any errors (https://librdf.org/raptor/rapper.html).