Switch JS engine to Oniguruma-To-ES for much improved emulation

slevithan commented 3 weeks ago

I've just released Oniguruma-To-ES, based on our previous conversation #762. There's room for improvement, but it's already very solid, and it includes detailed documentation (and a demo page). Adopting it would put Shiki on a solid path to being able to use the JS engine by default in the future if you want to, and not have to label it as experimental.

Compatibility report

The JS regex engine compatibility report currently says 176 supported, 23 mismatched, and 15 unsupported. Running it with Oniguruma-To-ES v0.1 with "loose" accuracy gives: 178 supported, 18 mismatched, and 18 unsupported (but see below for why this should be higher).

return toRegExp(pattern, {
  accuracy: 'loose',
  global: true,
  hasIndices: true,
  tmGrammar: true,
})

(Note: "loose" makes its \G handling simply set sticky when it doesn't have an existing emulation strategy for the specific way \G is used, like oniguruma-to-js does. But even when this is on, Oniguruma-To-ES has a lot more strategies for precise \G emulation than the current JS engine.)

However, one problem with these numbers is that Oniguruma-To-ES is throwing for patterns with genuine Oniguruma errors (they include errors by their authors) that Shiki's current JS engine is unable to detect problems for (because they're not errors in JS regexes). In these cases, the current engine is just passing regexes that are invalid in Oniguruma through, whereas appropriate Oniguruma-To-ES errors are preventing this from happening. But vscode-textmate silently fails when the real Oniguruma throws an error! So to account for this difference, let's try the report with forgiving on:

Current engine: 178 supported, 36 mismatched, 0 unsupported.
Oniguruma-To-ES: 183 supported, 31 mismatched, 0 unsupported.

In fact, this is a more accurate representation of results. Because TextMate grammars fail silently for Oniguruma regexes that error, I think it would be appropriate to change the JS engine to always use forgiving mode, and maybe rename it as something like failOnError with default false. The "unsupported" and "mismatched" lists should be merged, although it would be good to to keep the failOnError option around for development (to help identify potential opportunities for improvement).

Another metric we have is the report's "diff" count. If you run both libraries with forgiving then sum the diff, it's 8750 for current Shiki vs 6556 for Oniguruma-To-ES (lower is better).

But even though these numbers are already better across the board, IMO this doesn't nearly tell the full story of how much better it is!

The compatibility report underrepresents gains

Compared to the existing oniguruma-to-js, the new Oniguruma-To-ES includes more than a hundred big and small improvements to accuracy and emulation robustness. Additional syntax is supported, more syntax and behavior differences between Oniguruma and JS are captured accurately (many are very subtle or complicated), and many things that oniguruma-to-js currently supports in a flawed way are fixed. Some of these improvements and many of the edge cases probably don't matter for the particular samples and/or regexes we're testing against, but knowing that their handling is in place gives a strong foundation that is less brittle and can be built on with confidence.

Although the current JS engine test/report setup is great and was super helpful, the simplicity of the samples makes it underrepresent the impact of both flaws and incremental gains in emulation accuracy. Shiki's current engine has been gamed in various ways to pass these tests, especially via incomplete/flawed emulation approaches that are good enough to pass for included samples, but wouldn't pass in some other cases. I think the current scores significantly underrepresent how well Oniguruma-To-ES would compare if given much more complex samples that really pushed the differences.

More details

This project was months of intense work, and some of the things it supports were wildly complicated to emulate in a way that covers all the edge cases exactly. But the result is better than I initially thought was possible. Even in this initial release, it's already very robust with its feature support and extremely accurate with its understanding of Oniguruma differences and nuances. Check out the detailed list of supported features in the readme. Also, probably most of the remaining mismatched/unsupported grammars can be supported in a robust and generalized way in future versions of Oniguruma-To-ES. The foundation is in place; they just need hard work.

I've focused on keeping the library size small so it works well in browsers, and although it's larger (17.6 KB minzip) than the current library, IMO its size is impressive given how robust it is. The small size was achieved in various ways, including by creating a totally custom compiler for it, and especially by avoiding the inclusion of heavyweight Unicode data (while still accurately supporting all aspects of Unicode and Unicode properties, including e.g. accurate Unicode case-folding for patterns with mixed case-sensitivity). I’m not aware of other libraries that have solved Unicode edge cases as robustly without full character data and thus maintained small size. Also, using Oniguruma-To-ES allows removing a lot of the "JS engine" code currently wrapped around oniguruma-to-js that does things like (^|\G) support and one-off recursive pattern replacements.

Target

You can choose between ES2018, ES2024, and ES2025. I'd highly recommend the default ES2024 (which requires Node.js 20 or any 2023-era browser; see compat table) because I want the best emulation possible. But using target ES2018, if you prefer, would only lose support for a handful of grammars.

Oniguruma-To-ES is ready for the future by already supporting ES2025 as a target. When you're eventually ready to switch to it (it requires Node.js 23 and is not yet supported by Safari), it will offer faster transpilation of some features, simpler generated regexes, and improved handling for case-insensitive backreferences to case-sensitive groups.

Validations

[X] Follow our Code of Conduct
[X] Read the Contributing Guide.
[X] Check that there isn't already an issue that request the same feature to avoid creating a duplicate.

slevithan commented 2 weeks ago

I created a list here of future improvements to Oniguruma support. A variety of things there should at least slightly improve grammar coverage.

slevithan commented 2 weeks ago

Heads up that I've edited the original comment to include a lot more details, including important information about target and Shiki's forgiving mode.

antfu commented 2 weeks ago

Wow, that's a super impressive work! Thanks a lot for taking time working on it! 💚

It seems you already have the code for testing, would you mind to send a PR directly for that change? I would be more than happy to deprecate my package in favor of that!

I think it would be appropriate to change the JS engine to always use forgiving mode

Yeah, that's true if we are thinking to align with Oniguruma. The reason I didn't do that is I was trying to push the grammar authors to be aware of their syntax errors (which I didn't have time to report upstream tho). If we make it forgiving by default, I am worried if they would never ever get fixed. I think we could keep it as-is, while mention that in the docs so people would decide which behavior they want to have.

slevithan commented 2 weeks ago

Wow, that's a super impressive work! Thanks a lot for taking time working on it! 💚

Thanks! It was much more work than I anticipated and I really need to take a bit of a break now. 🥱

It seems you already have the code for testing, would you mind to send a PR directly for that change? I would be more than happy to deprecate my package in favor of that!

No problem: #832.

I think it would be appropriate to change the JS engine to always use forgiving mode

Yeah, that's true if we are thinking to align with Oniguruma. The reason I didn't do that is I was trying to push the grammar authors to be aware of their syntax errors (which I didn't have time to report upstream tho). If we make it forgiving by default, I am worried if they would never ever get fixed. I think we could keep it as-is, while mention that in the docs so people would decide which behavior they want to have.

That's fair. Thanks for the explanation.

slevithan commented 2 weeks ago

I would be more than happy to deprecate my package in favor of that!

PS: This project wouldn't have happened without your oniguruma-to-js (and our previous work to include regex in it). Also, the fact that you put a lot of work into getting the report numbers up pushed me to have to do better with the new compiler in order to compete. 😊

slevithan commented 2 weeks ago

Current engine

Default: 176 supported, 23 mismatched, and 15 unsupported.
Forgiving: 178 supported, 36 mismatched, 0 unsupported.
Total diff for mismatched grammars (with forgiving): 8750.

Oniguruma-To-ES v0.1.1 (previously reported)

Default: 178 supported, 18 mismatched, and 18 unsupported.
Forgiving: 183 supported, 31 mismatched, 0 unsupported.
Total diff for mismatched grammars (with forgiving): 6556.

Oniguruma-To-ES v0.3.0 (new)

Default: 181 supported, 20 mismatched, and 13 unsupported.
Forgiving: 185 supported, 29 mismatched, 0 unsupported.
Total diff for mismatched grammars (with forgiving): 6529

shikijs / shiki