Syntax using with_prototype leaks memory

viluon commented 5 years ago

Description

Using with_prototype in a syntax definition causes a severe memory leak very similar to https://github.com/SublimeTextIssues/Core/issues/1288 (I have been instructed to open a new issue for this one).

You can download the leaky syntax definition on Pastebin.

Steps to reproduce

Install the provided syntax definition
Open a new file
Set syntax to Luigi
Uncomment the part of the syntax definition with the comment # BUG: ...
Save
Watch ST3 stall and eat your RAM very quickly

Expected behavior

The editor wouldn't leak memory.

Actual behavior

Leaks at over 0.5 GB/s on my machine.

Environment

Build: 3176, 3200
Operating system and version: Arch Linux, kernel 4.20.6-arch1-1-ARCH
[Linux] Desktop Environment and/or Window Manager: swaywm

wbond commented 5 years ago

It would appear you've got a loop of:

built-in-fn
sets to (anonymous copy of) block
includes expression
expression include built-in-macros
built-in-macros pushes into built-in-fn
repeat

When you use with_prototype, you create a copy of the entire graph of contexts, with the with_prototype rules prepended. Since contexts can be included in a complex chain of inclusion and reference, it is not simple, to know when a proper combination of with_prototype rules has been applied.

It may seem obvious to you that the above recursion should use the same copy of built-in-fn, but it is also easy to come up with a chain that is not so obvious, such as one of the intermediate context having its own with_prototype. Then you'd need to create a new copy of each context with both with_prototype list of rules prepended.

This is why I say a general solution is not "simple".

However, in your context it isn't obvious to me that you even should be using with_prototype here. Can you really have return in the middle of an array literal or a comment? Typically I would expect to see an anonymous context with the return rule listed first, then something like include: block.

viluon commented 5 years ago

@wbond thanks for the quick reply.

You are correct about the misuse of with_prototype here, I should have read the docs more carefully. However, merely pushing an anonymous context with include: block doesn't result in the desired behaviour: return can occur in sub-expressions which have a different context and thus won't be tagged with the desired scope.

Note that, despite the recursion, Luigi highlighting would always terminate in theory. I was not aware that the syntax definition is pre-compiled in a manner that prevents recursion. While your implementation is proprietary, if there's any chance you're compiling the syntax to a DFA, have you considered making with_prototype a special directive, adding its edges to the current node at runtime? (Basically the set of edges leading from the current node becomes the union of its attached out-edges and the with_prototype rules in effect).

wbond commented 5 years ago

We likely won't offer the ability to add an arbitrary number of rules at run-time for performance sake, but the existing embed functionality does its specialized matching at run time rather than compile time.

viluon commented 5 years ago

We likely won't offer the ability to add an arbitrary number of rules at run-time for performance sake

I'd say limiting to something like 32 and de-duplicating rules (so that multiple encounters of the same with_prototype don't reach this limit) would probably suffice.

wbond commented 5 years ago

Run time rules require Oniguruma, the fallback (read slow) regex engine, because our optimized regex engine sacrifices compile time for improved run time performance.

Additionally, the nature of embed means we almost always are matching on a simple rule, and whatever matches is used. As soon as you get into normal rules, you have to find the rule that matches with the next-closest offset. So we'd have to run the run-time rules and compare the resulting offsets with the precompiled ones. Since match rules are often not so simple, and are used far more frequently than embed, it is unlikely to be something we would do.

viluon commented 5 years ago

I see. You could theoretically keep multiple compiled versions of the syntax (with and without with_prototype) rules, but I see how that's impractical, especially with the amount of memory used being exponential in the number of with_prototype directives.

Just out of curiosity, approximately how many times slower is matching with Oniguruma compared to using the optimising compiler?

wbond commented 5 years ago

I believe in the past we've seen improvements of 30-50%, although I may be remembering incorrectly. There are probably some old comments on PRs related to https://github.com/sublimehq/Packages/issues/481.

viluon commented 5 years ago

@wbond ah, so the optimising compiler also uses backtracking? Maybe it'd be worthwhile implementing optimisations for a simpler set of regular expressions.

wbond commented 5 years ago

ah, so the optimising compiler also uses backtracking?

No

Maybe it'd be worthwhile implementing optimisations for a simpler set of regular expressions

Jon has definitely read that (he is the author of Sublime's custom regex engine) and our engine even has some unique elements that aren't implemented by other engines.

viluon commented 5 years ago

No

Interesting. From what I know from the linked article, I'd either expect your custom engine to run orders of magnitude faster than Oniguruma but not support backreferences, or support backreferences, use backtracking, and perform more or less equally (with the added benefit of possible compile-time optimisations on the regex). There must be a graph search approach well-suited for regex matching that I am unaware of; any pointers would be appreciated. I understand if such information is considered a trade secret, however.

FichteFoll commented 5 years ago

The orders of magnitude you're thinking of only become relevant if there is a lot of backtracking to begin with, e.g. patterns with many quantifiers. Most patterns in syntax definitions do not rely on these, so the overall performance gain for these patterns will be limited but noticeable.

viluon commented 5 years ago

@keith-hall while my syntax definition shouldn't have used with_prototype in the first place, I would consider leaking memory to be a bug. Perhaps this issue should be closed in favour of #1288, however.

@FichteFoll good point. So does your regex engine support backreferences?

wbond commented 4 years ago

Build 4075 adds protection to prevent recursion when using with_prototype.

sublimehq / sublime_text