NQuadsParser fails when object literal contains certain Unicode characters (e.g. \u0432\u0440\u043E\u0440\u0438\u043D...)

k00ni commented 5 months ago

NQuadsParser fails for the following ontology:

It seems to fail if a triple looks like:

<http://nl.ijs.si/ME/owl/multext-east.owl#FutureTense> <http://www.w3.org/2000/01/rdf-schema#comment> "e.g., \u0430\u0431\u0435\u0442\u043A\u0443\u0432\u0430\u0442\u0438\u043C\u0435/\u0430\u0431\u0435\u0442\u043A\u0443\u0432\u0430\u0442\u0438 \u0430\u0432\u0456\u0437\u0443\u0432\u0430\u0442\u0438\u043C\u0435/\u0430\u0432\u0456\u0437\u0443\u0432\u0430\u0442\u0438 \u0430\u0433\u0430\u043A\u0430\u0442\u0438\u043C\u0435/\u0430\u0433\u0430\u043A\u0430\u0442\u0438 \u0430\u0433\u0456\u0442\u0443\u0432\u0430\u0442\u0438\u043C\u0435/\u0430\u0433\u0456\u0442\u0443\u0432\u0430\u0442\u0438 \u0430\u0433\u043E\u043D\u0456\u0437\u0443\u0432\u0430\u0442\u0438\u043C\u0435/\u0430\u0433\u043E\u043D\u0456\u0437\u0443\u0432\u0430\u0442\u0438 \u0430\u0434\u0432\u043E\u043A\u0430\u0442\u0443\u0432\u0430\u0442\u0438\u043C\u0435/\u0430\u0434\u0432\u043E\u043A\u0430\u0442\u0443\u0432\u0430\u0442\u0438 ..."

Parsing it results in an RdfIoException:

quickRdfIo\RdfIoException: Can't parse end "e.g., \u0430\u0431\u0435\u0442\ [...] (uk)" .
/var/www/html/src/quickRdfIo/NQuadsParser.php:397
/var/www/html/src/quickRdfIo/NQuadsParser.php:323
/var/www/html/src/quickRdfIo/NQuadsParser.php:230
/var/www/html/tests/NQuadsParserTest.php:256

I prepared a test case so you can see for yourself:

It seems to be valid n-triples, because rapper is parsing it without any errors (https://librdf.org/raptor/rapper.html).

zozlak commented 5 months ago

My guess is it is a combination of the line length (almost 27k characters for the line throwing the error; it works once shortened to around 25k) and the complexity of the regex used to parse it (because the parser is able to parse the RDF star but the RDF star has no dedicated mime types the Util class initializes the parser in the RDF star parsing mode) reach some preg library limits.

Now a good question is how much we want to generalize this particular example. The problem is in general there will be always a line size too long to be properly parsed with a regex. The question is only which size we consider sane and therefore want to implement workarounds for it. Available workarounds are (for sure incomplete, just what quickly comes to my mind):

Adjust the Util class so for some format strings (e.g. ntriples/n-triples) it initializes the parser in the n-triples mode and not in the n-triples star mode. The non-star regex is much simpler therefore it can deal with larger input.
Split parsing into smaller steps. Currently in the star parsing mode we try to parse the whole object part of a triple with one regex. This could be split into an attempt to parse a literal, named node, blank node or a triple-as-object as separate steps. This will hurt performance slightly and make the code more complex so it would be good to evaluate the actual trade-off.
Try to optimize the regex in use (e.g. using DEFINE).

By the way it is quite suboptimal to store all non-ASCII characters as unicode escape sequences (5 times more characters and in most cases 2.5 more bytes) and if only they were normal UTF characters in the input, there would be no error. But this is nothing we can change.

zozlak commented 5 months ago

I created the issue7 branch where I:

isolated the problem the problem to two triples
fixed the regex match error recognition and reporting so we are now clearly informed that we are affected by the "JIT stack limit exhausted" error (see e.g. here)
implemented the first of above-mentioned solutions but it did help here (the JIT stack is exhausted both on RDF star and non-RDF star regex-s)

zozlak commented 5 months ago

So the problem is caused by the literal value recognition pattern in the non-strict mode: "((?>([^"]|\\")*))" (double quotes, then any number of: either everything-other-than-double-quotes or backslash-followed-by-double-quotes, then double quotes). I see two workarounds:

Use the literal pattern from the strict mode. It is much longer ("((?>[^\x{22}\x{5C}\x{0A}\x{0D}]|\\[tbnrf"'\\]|\\u[0-9A-Fa-f]{4}|\\U[0-9A-Fa-f]{8})*)") but does not cause trouble with the JIT (I guess the important difference is all strict mode alternatives [^\x{22}\x{5C}\x{0A}\x{0D}], \\[tbnrf"'\\], \\u[0-9A-Fa-f]{4} and \\U[0-9A-Fa-f]{8} are strictly disjoint while there is an overlap between [^"] and \\" in the non-strict pattern).
If the preg_last_error() === PREG_JIT_STACKLIMIT_ERROR), turn the PCRE JIT off and parse again and then turn the JIT on again. Funnily it also works well.

Let's implement the first approach and remember about the second if we encounter the issue again.

It is worth noting that simplifying the pattern to "((?>.*))" will not work as a triple can be followed by a commend which can contain any characters including double quotes.

sweetrdf / quickRdfIo

NQuadsParser fails when object literal contains certain Unicode characters (e.g. \u0432\u0440\u043E\u0440\u0438\u043D...) #7