Open slevithan opened 5 days ago
Re-opening after additional research and editing my comment to add more clarifying details.
In addition to causing edge case bugs, this issue is also preventing the Hack grammar from running in Shiki when using its JS engine (which transpiles Oniguruma regexes to JS using Oniguruma-To-ES). Oniguruma-To-ES intentionally doesn't reproduce Oniguruma's bugs related to handling of unenclosed \xF5
through \xFF
, and instead throws for them (as Oniguruma does for \x80
through \xF4
).
Problem
This line in the Hack grammar is an invalid Oniguruma pattern. While it doesn't throw, it also doesn't match what is intended due to the use of encoded byte value
\xff
(which doesn't work the same as code point\x{ff}
in Oniguruma).The error in this pattern is in the negated range
[^...\\x7f-\\xff]
. From context, it's easy to figure out that the author meant for this to exclude, among other things, the range from code point U+007F to U+00FF. And although that's what it would do in nearly every modern regex engine, that's not exactly what it's doing in Oniguruma.In Oniguruma,
\xHH
matches an "encoded byte value", not a code point value like the enclosed version\x{...}
does. That means for values from0
to7F
,\xHH
and\x{HH}
work the same, but they diverge for values80
toFF
. With\xHH
above7F
, the token must be part of a valid encoded byte sequence. E.g. the three-byte sequence\xEF\xBB\xBF
in Oniguruma is equivalent to the single code point\x{FEFF}
in Oniguruma (and not the same as the three code points\xEF\xBB\xBF
in JavaScript and other regex flavors).\xFF
, on it's own, is not a valid encoded byte sequence. In Oniguruma, standalone invalid byte values\x80
through\xF4
throw. However, due to the peculiarities of Oniguruma's error handling,\xF5
through\xFF
do not throw, but they also don't match what's intended. This seems to be because bytesF5
throughFF
are unused in UTF-8 encoding. So instead of throwing in Oniguruma, these byte values don't match anything if used on their own, and they cause unintended/difficult-to-predict behavior in character class ranges, intersection, etc.Current behavior
If the pattern in the Hack grammar used e.g.
\x7f-\xf4
instead of\x7f-\xff
, it would be a clear error (and throw). But because it happens to use an encoded byte in the range ofF5
toFF
, the behavior is more complicated (but still wrong). Unlike how this pattern would behave in almost every other regex engine, it is preventing the negated character class from matching any code point aboveFF
, which does not seem to be intended.Fix
To fix this, the
\\xff
should be replaced with\\x{ff}
. It will then run correctly in Oniguruma.