SPARQL test suite missing coverage of codepoint escapes

w3c / rdf-tests

Repository for the RDF Tests Community Group

w3c.github.io/rdf-tests

Other

42 stars 23 forks source link

SPARQL test suite missing coverage of codepoint escapes #64

Closed kasei closed 3 years ago

kasei commented 3 years ago

AFAICT, the SPARQL test suite does not have any coverage of the 4-byte variant of codepoint escape sequences (\UXXXXXXXX). This could be a problem for implementations that defer escape handling to tools that only support the 2-byte variant (such as, I believe, the javacc parser generator).

ericprud commented 3 years ago

While you're at it, how about bare codepoints themselves? Got any like https://github.com/shexSpec/shexTest/blob/master/schemas/1val1STRING_LITERAL1_with_UTF8_boundaries.shex ?

kasei commented 3 years ago

There are i18n tests in the original DAWG suite, but they similarly stay below U+FFFF.

ericprud commented 3 years ago

The strategy I took in ShEx (and some in Turtle before) was to test the boundaries of permissible characters and, iirc, the boundaries of UTF-8 representations. For instance, the above ShEx has the characters 0x80, 0x7FF, 0x800, 0xFFF, 0x1000, 0xCFFF, 0xD000, 0xD7FF, 0xE000, 0xFFFD, 0x10000, 0x3FFFD, 0x40000, 0xFFFFD, 0x100000, 0x10FFFD. The 0xD800-0xDFFF range is prohibited because it's used to encode the LSB of UTF-16 characters.

It also might be nice to reuse those names from the ShEx test suite because they have systematic names and the two languages have identical terminals.

kasei commented 3 years ago

@ericprud I've never been clear on whether a partial surrogate pair codepoint was actually prohibited, or just nonsensical. The SPARQL grammar and spec text don't seem to give any guidance on this.

Testing boundaries would definitely be a good idea, though I'm not sure if all languages/environments are happy working with unassigned code points. Would be interested in hearing any experience people have with that sort of data.

lisp commented 3 years ago

@ericprud I've never been clear on whether a partial surrogate pair codepoint was actually prohibited, or just nonsensical.

see: http://unicode.org/faq/utf_bom.html#utf16-7

kasei commented 3 years ago

see: http://unicode.org/faq/utf_bom.html#utf16-7

Thanks, @lisp. I think I've read right past that in the past because it is strangely under the "UTF-16 FAQ" section, and written in a way that seems (in my reading) to confuse 16 bit code units and a broad applicability to all "UTFs". Re-reading the UTF-8 section, this one seems relevant:

https://unicode.org/faq/utf_bom.html#utf8-5

kasei commented 3 years ago

@ericprud 4a12627 is a commit as part of #67 that adds syntax tests for unicode boundaries and the invalid use of a lone surrogate pair codepoint. What do you think?

ericprud commented 3 years ago

@kasei , i approved them, which might come off as arrogant but was intended as a procedural +1