sweetrdf / quickRdfIo

Other
2 stars 0 forks source link

NQuadsParser fails when object literal contains certain Unicode characters (e.g. \u0432\u0440\u043E\u0440\u0438\u043D...) #7

Closed k00ni closed 5 months ago

k00ni commented 5 months ago

NQuadsParser fails for the following ontology:

It seems to fail if a triple looks like:

<http://nl.ijs.si/ME/owl/multext-east.owl#FutureTense> <http://www.w3.org/2000/01/rdf-schema#comment> "e.g., \u0430\u0431\u0435\u0442\u043A\u0443\u0432\u0430\u0442\u0438\u043C\u0435/\u0430\u0431\u0435\u0442\u043A\u0443\u0432\u0430\u0442\u0438 \u0430\u0432\u0456\u0437\u0443\u0432\u0430\u0442\u0438\u043C\u0435/\u0430\u0432\u0456\u0437\u0443\u0432\u0430\u0442\u0438 \u0430\u0433\u0430\u043A\u0430\u0442\u0438\u043C\u0435/\u0430\u0433\u0430\u043A\u0430\u0442\u0438 \u0430\u0433\u0456\u0442\u0443\u0432\u0430\u0442\u0438\u043C\u0435/\u0430\u0433\u0456\u0442\u0443\u0432\u0430\u0442\u0438 \u0430\u0433\u043E\u043D\u0456\u0437\u0443\u0432\u0430\u0442\u0438\u043C\u0435/\u0430\u0433\u043E\u043D\u0456\u0437\u0443\u0432\u0430\u0442\u0438 \u0430\u0434\u0432\u043E\u043A\u0430\u0442\u0443\u0432\u0430\u0442\u0438\u043C\u0435/\u0430\u0434\u0432\u043E\u043A\u0430\u0442\u0443\u0432\u0430\u0442\u0438 ..."

Parsing it results in an RdfIoException:

quickRdfIo\RdfIoException: Can't parse end "e.g., \u0430\u0431\u0435\u0442\ [...] (uk)" .
/var/www/html/src/quickRdfIo/NQuadsParser.php:397
/var/www/html/src/quickRdfIo/NQuadsParser.php:323
/var/www/html/src/quickRdfIo/NQuadsParser.php:230
/var/www/html/tests/NQuadsParserTest.php:256

I prepared a test case so you can see for yourself:

It seems to be valid n-triples, because rapper is parsing it without any errors (https://librdf.org/raptor/rapper.html).

zozlak commented 5 months ago

My guess is it is a combination of the line length (almost 27k characters for the line throwing the error; it works once shortened to around 25k) and the complexity of the regex used to parse it (because the parser is able to parse the RDF star but the RDF star has no dedicated mime types the Util class initializes the parser in the RDF star parsing mode) reach some preg library limits.

Now a good question is how much we want to generalize this particular example. The problem is in general there will be always a line size too long to be properly parsed with a regex. The question is only which size we consider sane and therefore want to implement workarounds for it. Available workarounds are (for sure incomplete, just what quickly comes to my mind):

By the way it is quite suboptimal to store all non-ASCII characters as unicode escape sequences (5 times more characters and in most cases 2.5 more bytes) and if only they were normal UTF characters in the input, there would be no error. But this is nothing we can change.

zozlak commented 5 months ago

I created the issue7 branch where I:

zozlak commented 5 months ago

So the problem is caused by the literal value recognition pattern in the non-strict mode: "((?>([^"]|\\")*))" (double quotes, then any number of: either everything-other-than-double-quotes or backslash-followed-by-double-quotes, then double quotes). I see two workarounds:

Let's implement the first approach and remember about the second if we encounter the issue again.

It is worth noting that simplifying the pattern to "((?>.*))" will not work as a triple can be followed by a commend which can contain any characters including double quotes.