Closed viluon closed 4 years ago
It would appear you've got a loop of:
When you use with_prototype
, you create a copy of the entire graph of contexts, with the with_prototype
rules prepended. Since contexts can be included in a complex chain of inclusion and reference, it is not simple, to know when a proper combination of with_prototype rules has been applied.
It may seem obvious to you that the above recursion should use the same copy of built-in-fn, but it is also easy to come up with a chain that is not so obvious, such as one of the intermediate context having its own with_prototype. Then you'd need to create a new copy of each context with both with_prototype
list of rules prepended.
This is why I say a general solution is not "simple".
However, in your context it isn't obvious to me that you even should be using with_prototype
here. Can you really have return
in the middle of an array literal or a comment? Typically I would expect to see an anonymous context with the return
rule listed first, then something like include: block
.
@wbond thanks for the quick reply.
You are correct about the misuse of with_prototype
here, I should have read the docs more carefully. However, merely pushing an anonymous context with include: block
doesn't result in the desired behaviour: return
can occur in sub-expressions which have a different context and thus won't be tagged with the desired scope.
Note that, despite the recursion, Luigi highlighting would always terminate in theory. I was not aware that the syntax definition is pre-compiled in a manner that prevents recursion. While your implementation is proprietary, if there's any chance you're compiling the syntax to a DFA, have you considered making with_prototype
a special directive, adding its edges to the current node at runtime? (Basically the set of edges leading from the current node becomes the union of its attached out-edges and the with_prototype
rules in effect).
We likely won't offer the ability to add an arbitrary number of rules at run-time for performance sake, but the existing embed
functionality does its specialized matching at run time rather than compile time.
We likely won't offer the ability to add an arbitrary number of rules at run-time for performance sake
I'd say limiting to something like 32 and de-duplicating rules (so that multiple encounters of the same with_prototype
don't reach this limit) would probably suffice.
Run time rules require Oniguruma, the fallback (read slow) regex engine, because our optimized regex engine sacrifices compile time for improved run time performance.
Additionally, the nature of embed
means we almost always are matching on a simple rule, and whatever matches is used. As soon as you get into normal rules, you have to find the rule that matches with the next-closest offset. So we'd have to run the run-time rules and compare the resulting offsets with the precompiled ones. Since match rules are often not so simple, and are used far more frequently than embed
, it is unlikely to be something we would do.
I see. You could theoretically keep multiple compiled versions of the syntax (with and without with_prototype
) rules, but I see how that's impractical, especially with the amount of memory used being exponential in the number of with_prototype
directives.
Just out of curiosity, approximately how many times slower is matching with Oniguruma compared to using the optimising compiler?
I believe in the past we've seen improvements of 30-50%, although I may be remembering incorrectly. There are probably some old comments on PRs related to https://github.com/sublimehq/Packages/issues/481.
@wbond ah, so the optimising compiler also uses backtracking? Maybe it'd be worthwhile implementing optimisations for a simpler set of regular expressions.
ah, so the optimising compiler also uses backtracking?
No
Maybe it'd be worthwhile implementing optimisations for a simpler set of regular expressions
Jon has definitely read that (he is the author of Sublime's custom regex engine) and our engine even has some unique elements that aren't implemented by other engines.
No
Interesting. From what I know from the linked article, I'd either expect your custom engine to run orders of magnitude faster than Oniguruma but not support backreferences, or support backreferences, use backtracking, and perform more or less equally (with the added benefit of possible compile-time optimisations on the regex). There must be a graph search approach well-suited for regex matching that I am unaware of; any pointers would be appreciated. I understand if such information is considered a trade secret, however.
The orders of magnitude you're thinking of only become relevant if there is a lot of backtracking to begin with, e.g. patterns with many quantifiers. Most patterns in syntax definitions do not rely on these, so the overall performance gain for these patterns will be limited but noticeable.
@keith-hall while my syntax definition shouldn't have used with_prototype
in the first place, I would consider leaking memory to be a bug. Perhaps this issue should be closed in favour of #1288, however.
@FichteFoll good point. So does your regex engine support backreferences?
Build 4075 adds protection to prevent recursion when using with_prototype
.
Description
Using
with_prototype
in a syntax definition causes a severe memory leak very similar to https://github.com/SublimeTextIssues/Core/issues/1288 (I have been instructed to open a new issue for this one).You can download the leaky syntax definition on Pastebin.
Steps to reproduce
# BUG: ...
Expected behavior
The editor wouldn't leak memory.
Actual behavior
Leaks at over 0.5 GB/s on my machine.
Environment
4.20.6-arch1-1-ARCH