Closed k00ni closed 2 months ago
This is a BOM for the UTF-8. From the design perspective I would say handling it is out of scope of the hardf library. The hardf library parses strings containing an RDF document while the BOM belongs to the lower software level - it tells whoever is reading a text file how to properly decode it to a string.
By the way including a BOM in UTF-8 is pretty useless but you just scrapped this file on the net I guess so it's others fault.
To wrap up:
Thanks for the clarification @zozlak. Its mind boggling that in 2024 we still have handle things like that. You are probably right, it might not be hardf's responsibility when this happens, but a common developer might not think about it when something like that is encountered. Someone receives/loads RDF as file or a string and wants to read it. I would argue that hardf should at least be tolerant enough here, or what do you think @pietercolpaert?
Unfortunately I don't have time to investigate further, but I will take your offer @zozlak to discuss this further in the quickRdfIo repository.
Solution would be as simple as add dependency on some already existing BOM handling lib (e.g. https://packagist.org/packages/duncan3dc/bom-string or https://packagist.org/packages/fab2s/bom) and use it to strip-BOM-if-exists from the input string.
BOM handling (skipping the UTF-8 BOM and rising exceptions on other encodings' BOMs) has been added to the quickRdfIo 1.1.3. If you want a smooth experience, please use hardf over the quickRdfIo :-)
TriGParser / N3Lexer runs into an exception if a file contains certain control or Unicode characters. I couldn't find a way to display these characters in an editor (VSCode, gedit,...), so when you open it, don't wonder that there are no such characters before
@prefix
statements.The stack trace is:
(note: quickRdfIo was used)
Here is a prepared failing test:
https://github.com/pietercolpaert/hardf/blob/bug/parser-fails-control-or-unicode-characters/test/TriGParserTest.php#L2110-L2113
And here is the related N3 file:
https://lov.linkeddata.es/dataset/lov/vocabs/identity/versions/2014-04-03.n3