slackhq / vscode-hack

Hack language & HHVM debugger support for Visual Studio Code
https://marketplace.visualstudio.com/items?itemName=pranayagarwal.vscode-hack
MIT License
76 stars 38 forks source link

Use newer max unicode of 0x10ffff #79

Closed lildude closed 5 years ago

lildude commented 5 years ago

As detailed in https://github.com/slackhq/vscode-hack/issues/78, https://github.com/slackhq/vscode-hack/pull/72 introduced another error picked up by our grammar compiler. This time it's an invalid unicode regex match:

Invalid regex in grammar: `source.hack` (in `syntaxes/hack.json`) contains a malformed regex (regex "`(?xi)([a-z_\x{7f}-\x{7fffffff}]`...": character value in \x{} or \o{} is too large (at offset 30))

... and ...

Invalid regex in grammar: `source.hack` (in `syntaxes/hack.json`) contains a malformed regex (regex "`(?i)[a-z_\x{7f}-\x{7fffffff}][a-`...": character value in \x{} or \o{} is too large (at offset 27))

The line numbers have been truncated. but they correspond to...

https://github.com/slackhq/vscode-hack/blob/62329f6b026a75f805daf701071df45ba09330a5/syntaxes/hack.json#L910

... and ...

https://github.com/slackhq/vscode-hack/blob/62329f6b026a75f805daf701071df45ba09330a5/syntaxes/hack.json#L918

... respectively.

I suspect the intent here was to cover all unicode chars from 0x7F to the end, however 0x7FFFFFFF is no longer a valid UTF-8 unicode char. As of 2003, the max is 0x10FFFF.

From https://en.wikipedia.org/wiki/UTF-8#History:

In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences.

This PR addresses this by switching out \x{7fffffff} with \x{10ffff}.

Fixes https://github.com/slackhq/vscode-hack/issues/78