slackhq / vscode-hack

Hack language & HHVM debugger support for Visual Studio Code
https://marketplace.visualstudio.com/items?itemName=pranayagarwal.vscode-hack
MIT License
76 stars 38 forks source link

Hack grammar contains invalid Oniguruma token `\xff` #159

Open slevithan opened 5 days ago

slevithan commented 5 days ago

Problem

This line in the Hack grammar is an invalid Oniguruma pattern. While it doesn't throw, it also doesn't match what is intended due to the use of encoded byte value \xff (which doesn't work the same as code point \x{ff} in Oniguruma).

The error in this pattern is in the negated range [^...\\x7f-\\xff]. From context, it's easy to figure out that the author meant for this to exclude, among other things, the range from code point U+007F to U+00FF. And although that's what it would do in nearly every modern regex engine, that's not exactly what it's doing in Oniguruma.

In Oniguruma, \xHH matches an "encoded byte value", not a code point value like the enclosed version \x{...} does. That means for values from 0 to 7F, \xHH and \x{HH} work the same, but they diverge for values 80 to FF. With \xHH above 7F, the token must be part of a valid encoded byte sequence. E.g. the three-byte sequence \xEF\xBB\xBF in Oniguruma is equivalent to the single code point \x{FEFF} in Oniguruma (and not the same as the three code points \xEF\xBB\xBF in JavaScript and other regex flavors).

\xFF, on it's own, is not a valid encoded byte sequence. In Oniguruma, standalone invalid byte values \x80 through \xF4 throw. However, due to the peculiarities of Oniguruma's error handling, \xF5 through \xFF do not throw, but they also don't match what's intended. This seems to be because bytes F5 through FF are unused in UTF-8 encoding. So instead of throwing in Oniguruma, these byte values don't match anything if used on their own, and they cause unintended/difficult-to-predict behavior in character class ranges, intersection, etc.

Current behavior

If the pattern in the Hack grammar used e.g. \x7f-\xf4 instead of \x7f-\xff, it would be a clear error (and throw). But because it happens to use an encoded byte in the range of F5 to FF, the behavior is more complicated (but still wrong). Unlike how this pattern would behave in almost every other regex engine, it is preventing the negated character class from matching any code point above FF, which does not seem to be intended.

Fix

To fix this, the \\xff should be replaced with \\x{ff}. It will then run correctly in Oniguruma.

slevithan commented 4 days ago

Re-opening after additional research and editing my comment to add more clarifying details.

slevithan commented 4 days ago

In addition to causing edge case bugs, this issue is also preventing the Hack grammar from running in Shiki when using its JS engine (which transpiles Oniguruma regexes to JS using Oniguruma-To-ES). Oniguruma-To-ES intentionally doesn't reproduce Oniguruma's bugs related to handling of unenclosed \xF5 through \xFF, and instead throws for them (as Oniguruma does for \x80 through \xF4).