TriGParser/N3Lexer fails when certain control / Unicode characters appear in string (e.g. \xEF\xBB\xBF)

k00ni commented 5 months ago

TriGParser / N3Lexer runs into an exception if a file contains certain control or Unicode characters. I couldn't find a way to display these characters in an editor (VSCode, gedit,...), so when you open it, don't wonder that there are no such characters before @prefix statements.

The stack trace is:

Fatal error: Uncaught Exception: Unexpected "@prefix" on line 1. 
in /govi/scripts/vendor/pietercolpaert/hardf/src/N3Lexer.php:456

Stack trace:
#0 /..-/hardf/src/N3Lexer.php(109): pietercolpaert\hardf\N3Lexer->syntaxError('\xEF\xBB\xBF@prefix', 1)
#1 /.../hardf/src/N3Lexer.php(408): pietercolpaert\hardf\N3Lexer->pietercolpaert\hardf\{closure}(Object(pietercolpaert\hardf\N3Lexer))
#2 /.../hardf/src/N3Lexer.php(470): pietercolpaert\hardf\N3Lexer->tokenizeToEnd(Object(Closure), false)
#3 /.../hardf/src/N3Lexer.php(491): pietercolpaert\hardf\N3Lexer->pietercolpaert\hardf\{closure}('\xEF\xBB\xBF@prefix owl:...', false)
#4 /.../hardf/src/TriGParser.php(1183): pietercolpaert\hardf\N3Lexer->tokenize('\xEF\xBB\xBF@prefix owl:...', false)

#5 /govi/scripts/vendor/sweetrdf/quick-rdf-io/src/quickRdfIo/TriGParser.php(159): 
    pietercolpaert\hardf\TriGParser->parseChunk('\xEF\xBB\xBF@prefix owl:...')

(note: quickRdfIo was used)

Here is a prepared failing test:

https://github.com/pietercolpaert/hardf/blob/bug/parser-fails-control-or-unicode-characters/test/TriGParserTest.php#L2110-L2113

And here is the related N3 file:

https://lov.linkeddata.es/dataset/lov/vocabs/identity/versions/2014-04-03.n3

zozlak commented 2 months ago

This is a BOM for the UTF-8. From the design perspective I would say handling it is out of scope of the hardf library. The hardf library parses strings containing an RDF document while the BOM belongs to the lower software level - it tells whoever is reading a text file how to properly decode it to a string.

By the way including a BOM in UTF-8 is pretty useless but you just scrapped this file on the net I guess so it's others fault.

To wrap up:

I would suggest closing this issue as not applicable to the hardf
Please feel free to create an issue in the quickRdfIo requesting checking for BOM if the input is a file

k00ni commented 2 months ago

Thanks for the clarification @zozlak. Its mind boggling that in 2024 we still have handle things like that. You are probably right, it might not be hardf's responsibility when this happens, but a common developer might not think about it when something like that is encountered. Someone receives/loads RDF as file or a string and wants to read it. I would argue that hardf should at least be tolerant enough here, or what do you think @pietercolpaert?

Unfortunately I don't have time to investigate further, but I will take your offer @zozlak to discuss this further in the quickRdfIo repository.

zozlak commented 2 months ago

Solution would be as simple as add dependency on some already existing BOM handling lib (e.g. https://packagist.org/packages/duncan3dc/bom-string or https://packagist.org/packages/fab2s/bom) and use it to strip-BOM-if-exists from the input string.

zozlak commented 5 days ago

BOM handling (skipping the UTF-8 BOM and rising exceptions on other encodings' BOMs) has been added to the quickRdfIo 1.1.3. If you want a smooth experience, please use hardf over the quickRdfIo :-)

pietercolpaert / hardf

TriGParser/N3Lexer fails when certain control / Unicode characters appear in string (e.g. \xEF\xBB\xBF) #44