slackhq / vscode-hack

Hack language & HHVM debugger support for Visual Studio Code
https://marketplace.visualstudio.com/items?itemName=pranayagarwal.vscode-hack
MIT License
76 stars 38 forks source link

[Syntax Highlighting] Invalid unicode regex match #78

Closed lildude closed 5 years ago

lildude commented 5 years ago

As with https://github.com/slackhq/vscode-hack/issues/76, our grammar compiler has found another error introduced in https://github.com/slackhq/vscode-hack/pull/72. This time it's an invalid unicode regex match:

Invalid regex in grammar: `source.hack` (in `syntaxes/hack.json`) contains a malformed regex (regex "`(?xi)([a-z_\x{7f}-\x{7fffffff}]`...": character value in \x{} or \o{} is too large (at offset 30))

... and ...

Invalid regex in grammar: `source.hack` (in `syntaxes/hack.json`) contains a malformed regex (regex "`(?i)[a-z_\x{7f}-\x{7fffffff}][a-`...": character value in \x{} or \o{} is too large (at offset 27))

The line numbers have been truncated. but they correspond to...

https://github.com/slackhq/vscode-hack/blob/62329f6b026a75f805daf701071df45ba09330a5/syntaxes/hack.json#L910

... and ...

https://github.com/slackhq/vscode-hack/blob/62329f6b026a75f805daf701071df45ba09330a5/syntaxes/hack.json#L918

... respectively.

I suspect the intent here was to cover all unicode chars from 0x7F to the end, however 0x7FFFFFFF is no longer a valid UTF-8 unicode char. As of 2003, the max is 0x10FFFF.

From https://en.wikipedia.org/wiki/UTF-8#History:

In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences.

PR coming up to implement this change.