Closed slevithan closed 1 week ago
I created a list here of future improvements to Oniguruma support. A variety of things there should at least slightly improve grammar coverage.
Heads up that I've edited the original comment to include a lot more details, including important information about target
and Shiki's forgiving
mode.
Wow, that's a super impressive work! Thanks a lot for taking time working on it! 💚
It seems you already have the code for testing, would you mind to send a PR directly for that change? I would be more than happy to deprecate my package in favor of that!
I think it would be appropriate to change the JS engine to always use forgiving mode
Yeah, that's true if we are thinking to align with Oniguruma. The reason I didn't do that is I was trying to push the grammar authors to be aware of their syntax errors (which I didn't have time to report upstream tho). If we make it forgiving
by default, I am worried if they would never ever get fixed. I think we could keep it as-is, while mention that in the docs so people would decide which behavior they want to have.
Wow, that's a super impressive work! Thanks a lot for taking time working on it! 💚
Thanks! It was much more work than I anticipated and I really need to take a bit of a break now. 🥱
It seems you already have the code for testing, would you mind to send a PR directly for that change? I would be more than happy to deprecate my package in favor of that!
No problem: #832.
I think it would be appropriate to change the JS engine to always use forgiving mode
Yeah, that's true if we are thinking to align with Oniguruma. The reason I didn't do that is I was trying to push the grammar authors to be aware of their syntax errors (which I didn't have time to report upstream tho). If we make it
forgiving
by default, I am worried if they would never ever get fixed. I think we could keep it as-is, while mention that in the docs so people would decide which behavior they want to have.
That's fair. Thanks for the explanation.
I would be more than happy to deprecate my package in favor of that!
PS: This project wouldn't have happened without your oniguruma-to-js
(and our previous work to include regex
in it). Also, the fact that you put a lot of work into getting the report numbers up pushed me to have to do better with the new compiler in order to compete. 😊
Current engine
Oniguruma-To-ES v0.1.1 (previously reported)
Oniguruma-To-ES v0.3.0 (new)
I've just released Oniguruma-To-ES, based on our previous conversation #762. There's room for improvement, but it's already very solid, and it includes detailed documentation (and a demo page). Adopting it would put Shiki on a solid path to being able to use the JS engine by default in the future if you want to, and not have to label it as experimental.
Compatibility report
The JS regex engine compatibility report currently says 176 supported, 23 mismatched, and 15 unsupported. Running it with Oniguruma-To-ES v0.1 with "loose"
accuracy
gives: 178 supported, 18 mismatched, and 18 unsupported (but see below for why this should be higher).(Note: "loose" makes its
\G
handling simply setsticky
when it doesn't have an existing emulation strategy for the specific way\G
is used, like oniguruma-to-js does. But even when this is on, Oniguruma-To-ES has a lot more strategies for precise\G
emulation than the current JS engine.)However, one problem with these numbers is that Oniguruma-To-ES is throwing for patterns with genuine Oniguruma errors (they include errors by their authors) that Shiki's current JS engine is unable to detect problems for (because they're not errors in JS regexes). In these cases, the current engine is just passing regexes that are invalid in Oniguruma through, whereas appropriate Oniguruma-To-ES errors are preventing this from happening. But vscode-textmate silently fails when the real Oniguruma throws an error! So to account for this difference, let's try the report with
forgiving
on:In fact, this is a more accurate representation of results. Because TextMate grammars fail silently for Oniguruma regexes that error, I think it would be appropriate to change the JS engine to always use
forgiving
mode, and maybe rename it as something likefailOnError
with defaultfalse
. The "unsupported" and "mismatched" lists should be merged, although it would be good to to keep thefailOnError
option around for development (to help identify potential opportunities for improvement).Another metric we have is the report's "diff" count. If you run both libraries with
forgiving
then sum thediff
, it's 8750 for current Shiki vs 6556 for Oniguruma-To-ES (lower is better).But even though these numbers are already better across the board, IMO this doesn't nearly tell the full story of how much better it is!
The compatibility report underrepresents gains
Compared to the existing oniguruma-to-js, the new Oniguruma-To-ES includes more than a hundred big and small improvements to accuracy and emulation robustness. Additional syntax is supported, more syntax and behavior differences between Oniguruma and JS are captured accurately (many are very subtle or complicated), and many things that oniguruma-to-js currently supports in a flawed way are fixed. Some of these improvements and many of the edge cases probably don't matter for the particular samples and/or regexes we're testing against, but knowing that their handling is in place gives a strong foundation that is less brittle and can be built on with confidence.
Although the current JS engine test/report setup is great and was super helpful, the simplicity of the samples makes it underrepresent the impact of both flaws and incremental gains in emulation accuracy. Shiki's current engine has been gamed in various ways to pass these tests, especially via incomplete/flawed emulation approaches that are good enough to pass for included samples, but wouldn't pass in some other cases. I think the current scores significantly underrepresent how well Oniguruma-To-ES would compare if given much more complex samples that really pushed the differences.
More details
This project was months of intense work, and some of the things it supports were wildly complicated to emulate in a way that covers all the edge cases exactly. But the result is better than I initially thought was possible. Even in this initial release, it's already very robust with its feature support and extremely accurate with its understanding of Oniguruma differences and nuances. Check out the detailed list of supported features in the readme. Also, probably most of the remaining mismatched/unsupported grammars can be supported in a robust and generalized way in future versions of Oniguruma-To-ES. The foundation is in place; they just need hard work.
I've focused on keeping the library size small so it works well in browsers, and although it's larger (17.6 KB minzip) than the current library, IMO its size is impressive given how robust it is. The small size was achieved in various ways, including by creating a totally custom compiler for it, and especially by avoiding the inclusion of heavyweight Unicode data (while still accurately supporting all aspects of Unicode and Unicode properties, including e.g. accurate Unicode case-folding for patterns with mixed case-sensitivity). I’m not aware of other libraries that have solved Unicode edge cases as robustly without full character data and thus maintained small size. Also, using Oniguruma-To-ES allows removing a lot of the "JS engine" code currently wrapped around oniguruma-to-js that does things like
(^|\G)
support and one-off recursive pattern replacements.Target
You can choose between
ES2018
,ES2024
, andES2025
. I'd highly recommend the defaultES2024
(which requires Node.js 20 or any 2023-era browser; see compat table) because I want the best emulation possible. But using targetES2018
, if you prefer, would only lose support for a handful of grammars.Oniguruma-To-ES is ready for the future by already supporting
ES2025
as a target. When you're eventually ready to switch to it (it requires Node.js 23 and is not yet supported by Safari), it will offer faster transpilation of some features, simpler generated regexes, and improved handling for case-insensitive backreferences to case-sensitive groups.Validations