Closed kasei closed 3 years ago
While you're at it, how about bare codepoints themselves? Got any like https://github.com/shexSpec/shexTest/blob/master/schemas/1val1STRING_LITERAL1_with_UTF8_boundaries.shex ?
There are i18n tests in the original DAWG suite, but they similarly stay below U+FFFF.
The strategy I took in ShEx (and some in Turtle before) was to test the boundaries of permissible characters and, iirc, the boundaries of UTF-8 representations. For instance, the above ShEx has the characters 0x80, 0x7FF, 0x800, 0xFFF, 0x1000, 0xCFFF, 0xD000, 0xD7FF, 0xE000, 0xFFFD, 0x10000, 0x3FFFD, 0x40000, 0xFFFFD, 0x100000, 0x10FFFD. The 0xD800-0xDFFF range is prohibited because it's used to encode the LSB of UTF-16 characters.
It also might be nice to reuse those names from the ShEx test suite because they have systematic names and the two languages have identical terminals.
@ericprud I've never been clear on whether a partial surrogate pair codepoint was actually prohibited, or just nonsensical. The SPARQL grammar and spec text don't seem to give any guidance on this.
Testing boundaries would definitely be a good idea, though I'm not sure if all languages/environments are happy working with unassigned code points. Would be interested in hearing any experience people have with that sort of data.
@ericprud I've never been clear on whether a partial surrogate pair codepoint was actually prohibited, or just nonsensical.
Thanks, @lisp. I think I've read right past that in the past because it is strangely under the "UTF-16 FAQ" section, and written in a way that seems (in my reading) to confuse 16 bit code units and a broad applicability to all "UTFs". Re-reading the UTF-8 section, this one seems relevant:
@ericprud 4a12627 is a commit as part of #67 that adds syntax tests for unicode boundaries and the invalid use of a lone surrogate pair codepoint. What do you think?
@kasei , i approved them, which might come off as arrogant but was intended as a procedural +1
AFAICT, the SPARQL test suite does not have any coverage of the 4-byte variant of codepoint escape sequences (
\UXXXXXXXX
). This could be a problem for implementations that defer escape handling to tools that only support the 2-byte variant (such as, I believe, the javacc parser generator).