jeff-hykin / better-cpp-syntax

💾 The source of VS Code's C++ syntax highlighting
GNU General Public License v3.0
155 stars 30 forks source link

Catastrophic backtracking in `function_definition` begin regex #579

Closed alexdima closed 4 months ago

alexdima commented 3 years ago

Coming via https://github.com/microsoft/vscode/issues/117264

The regex for function_definition takes >1s to evaluate at each step. This makes VS Code freeze for some minutes until it eventually recovers. If the regex would stop using \G, then we would cache its search results and even if it would still be slow, at least the cost would be 1s in total for the entire line. As it is right now, the usage of \G prohibits us from caching the search results, which means we need to re-search at each necessary offset (at each offset where tokenization advances).

the regex ``` (?:(?:^|\G|(?<=;|\}))|(?<=>))((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\￿)|(?:\Z)))(?:((?(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\￿)|(?:\Z))))?((?:(?:(?:\[\[.*?\]\]|__attribute(?:__)?\s*\(\s*\(.*?\)\s*\))|__declspec\(.*?\))|alignas\(.*?\))(?!\)))?((?:((?(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\￿)|(?:\Z))))*)(\s*+((?:(?:(?:\[\[.*?\]\]|__attribute(?:__)?\s*\(\s*\(.*?\)\s*\))|__declspec\(.*?\))|alignas\(.*?\))(?!\)))?((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\￿)|(?:\Z)))(?:(?:(?:(?:unsigned)|(?:signed)|(?:short)|(?:long))|(?:(?:struct)|(?:class)|(?:union)|(?:enum)))((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\￿)|(?:\Z))))*((?:::)?(?:(?!\b(?:__has_cpp_attribute|reinterpret_cast|atomic_noexcept|atomic_commit|atomic_cancel|__has_include|synchronized|dynamic_cast|thread_local|static_cast|const_cast|co_return|constexpr|constexpr|constexpr|co_return|protected|namespace|consteval|noexcept|decltype|template|operator|noexcept|co_yield|co_await|reflexpr|continue|co_await|co_yield|requires|volatile|register|restrict|explicit|volatile|noexcept|typename|default|_Pragma|mutable|include|concept|alignas|virtual|alignof|__asm__|defined|mutable|typedef|warning|private|and_eq|define|pragma|typeid|switch|bitand|return|ifndef|export|struct|sizeof|module|static|public|extern|inline|friend|delete|xor_eq|import|not_eq|class|compl|bitor|throw|or_eq|while|catch|break|union|const|const|endif|ifdef|undef|error|using|else|line|goto|else|elif|this|enum|case|new|asm|not|try|for|and|xor|or|if|do|if)\b)(?]*+|"(?:[^"]*|\\")")|'(?:[^']*|\\')')\g<60>?)+>)(?:\s)*+)?::)*+)?((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\￿)|(?:\Z)))(?!(?:(?:transaction_safe_dynamic)|(?:__has_cpp_attribute)|(?:reinterpret_cast)|(?:transaction_safe)|(?:atomic_noexcept)|(?:atomic_commit)|(?:atomic_cancel)|(?:__has_include)|(?:dynamic_cast)|(?:synchronized)|(?:thread_local)|(?:static_cast)|(?:const_cast)|(?:constexpr)|(?:consteval)|(?:co_return)|(?:co_return)|(?:constexpr)|(?:protected)|(?:constexpr)|(?:namespace)|(?:noexcept)|(?:typename)|(?:decltype)|(?:template)|(?:operator)|(?:noexcept)|(?:co_yield)|(?:co_await)|(?:continue)|(?:co_await)|(?:co_yield)|(?:volatile)|(?:register)|(?:restrict)|(?:explicit)|(?:override)|(?:volatile)|(?:reflexpr)|(?:noexcept)|(?:requires)|(?:alignas)|(?:typedef)|(?:nullptr)|(?:alignof)|(?:mutable)|(?:concept)|(?:virtual)|(?:defined)|(?:__asm__)|(?:include)|(?:_Pragma)|(?:mutable)|(?:default)|(?:warning)|(?:private)|(?:module)|(?:return)|(?:not_eq)|(?:xor_eq)|(?:and_eq)|(?:ifndef)|(?:pragma)|(?:export)|(?:import)|(?:sizeof)|(?:static)|(?:delete)|(?:public)|(?:define)|(?:extern)|(?:inline)|(?:typeid)|(?:switch)|(?:friend)|(?:bitand)|(?:false)|(?:compl)|(?:bitor)|(?:throw)|(?:or_eq)|(?:while)|(?:catch)|(?:break)|(?:const)|(?:final)|(?:const)|(?:endif)|(?:ifdef)|(?:undef)|(?:error)|(?:using)|(?:audit)|(?:axiom)|(?:line)|(?:else)|(?:elif)|(?:true)|(?:NULL)|(?:case)|(?:goto)|(?:else)|(?:this)|(?:new)|(?:asm)|(?:not)|(?:and)|(?:xor)|(?:try)|(?:for)|(?:if)|(?:do)|(?:or)|(?:if))\b)(?:[a-zA-Z_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))(?:[a-zA-Z0-9_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))*\b((?]*+|"(?:[^"]*|\\")")|'(?:[^']*|\\')')\g<60>?)+>)?(?![\w<:.]))(((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\￿)|(?:\Z)))?(?:(?:&|(?:\*))((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\￿)|(?:\Z))))*(?:&|(?:\*)))?((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\￿)|(?:\Z)))((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\￿)|(?:\Z)))((?:__cdecl|__clrcall|__stdcall|__fastcall|__thiscall|__vectorcall)?)((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\￿)|(?:\Z)))((::)?(?:(?!\b(?:__has_cpp_attribute|reinterpret_cast|atomic_noexcept|atomic_commit|atomic_cancel|__has_include|synchronized|dynamic_cast|thread_local|static_cast|const_cast|co_return|constexpr|constexpr|constexpr|co_return|protected|namespace|consteval|noexcept|decltype|template|operator|noexcept|co_yield|co_await|reflexpr|continue|co_await|co_yield|requires|volatile|register|restrict|explicit|volatile|noexcept|typename|default|_Pragma|mutable|include|concept|alignas|virtual|alignof|__asm__|defined|mutable|typedef|warning|private|and_eq|define|pragma|typeid|switch|bitand|return|ifndef|export|struct|sizeof|module|static|public|extern|inline|friend|delete|xor_eq|import|not_eq|class|compl|bitor|throw|or_eq|while|catch|break|union|const|const|endif|ifdef|undef|error|using|else|line|goto|else|elif|this|enum|case|new|asm|not|try|for|and|xor|or|if|do|if)\b)(?]*+|"(?:[^"]*|\\")")|'(?:[^']*|\\')')\g<60>?)+>)(?:\s)*+)?::)*\s*+)((?:[a-zA-Z_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))(?:[a-zA-Z0-9_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))*)\b(?(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\￿)|(?:\Z)))(?=\() ```
the test string ``` typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; ```

Here I have tested the regex at regex101.com and it leads to catastrophic backtracking: image

jeff-hykin commented 3 years ago

I'll test removing \G when I get a chance, although I'm pretty sure it will strongly effect correctness since it's not exactly a pattern that is just inserted casually.

I tested removing it, and substituting it, and doing either breaks 80% of all tests.

RedCMD commented 1 year ago

Can confirm function_definition causing the issue

Tho the lag seems to be constituted around variable declaration rather than the struct code image

int var;

int var; int var;

int var; int var; int var; int var;

int var; int var; int var; int var; int var; int var; int var; int var;

int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var;

int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var;

int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var;

int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var;

There is a double up of the comment handling group image Disabling it seems to improve performance by 2x but does not fix it entirely image image

I'm curious is (?>(?:\\s)+) not worse than \\s++? performance wise?

I too also had performance issues with (^|\\G) (in my own extension) https://github.com/RedCMD/TmLanguage-Syntax-Highlighter/commit/2b02db24c6c74d653bd99f8593ea235914e96bb6 I found splitting the rule into two different ones "match": "^..." and "match": "\\G..." was enough to fix my issue In some very specfic cases it was soo bad, that previously a 280,000 long line that could be parsed correctly, was reduced down to only 5000 characters being parsed correctly. a line from this repo actually https://github.com/textmate/c.tmbundle/blob/80c8e886b67227096a56aca6b92fe1587f76d603/Syntaxes/Platform.tmLanguage#L299 image

RedCMD commented 1 year ago

@jeff-hykin will the double up of the comment handling capture group be removed? I have found it to be causing 50% of the lag ((?:(?:(?:(?>(?:\\s)+)|(\\/\\*)((?:[^\\*]|(?:\\*)++[^\\/])*+((?:\\*)++\\/)))+)|(?:\\b)|(?=\\W)|(?<=\\W)|(?:\\A)|(?:\\Z))) https://github.com/jeff-hykin/better-cpp-syntax/blob/562ec9a12ecc21179d54e9bc7bfae5f4cd4fff00/autogenerated/cpp.tmLanguage.json#L4144

jeff-hykin commented 1 year ago

Hmm @RedCMD I don't see that pattern snippet in the file you linked.

is (?>(?:\s)+) not worse than \s++?

Yes I think its worse, and the grammar generator should automatically convert (?>(?:\\s)+) to \\s++(along with other simplifcations like removing redundancy within capture groups). However, I did notice last month I was using the original version of the grammar generator (which didn't have the optimization) so I updated it and regerated the grammar. So the last handful of releases shouldn't contain any (?>(?:\\s)+) patterns

RedCMD commented 1 year ago

sorry. the exact snippet has changed a bit but it is still there ((?:(?:(?:\\s*+(\\/\\*)((?:[^\\*]++|\\*+(?!\\/))*+(\\*\\/))\\s*+)+)|(?:\\s++)|(?<=\\W)|(?=\\W)|^|(?:\\n?$)|(?:\\A)|(?:\\Z))) if you paste that twice in ctrl+f find, you will see it in destructor_inline, function_definition and operator_overload https://github.com/jeff-hykin/better-cpp-syntax/blob/562ec9a12ecc21179d54e9bc7bfae5f4cd4fff00/autogenerated/cpp.tmLanguage.json#L2823-L2824 https://github.com/jeff-hykin/better-cpp-syntax/blob/562ec9a12ecc21179d54e9bc7bfae5f4cd4fff00/autogenerated/cpp.tmLanguage.json#L4143-L4144 https://github.com/jeff-hykin/better-cpp-syntax/blob/562ec9a12ecc21179d54e9bc7bfae5f4cd4fff00/autogenerated/cpp.tmLanguage.json#L7090-L7091

image notice the double up of the 4 sets of captures image

RedCMD commented 1 year ago

ahhhhh..... @jeff-hykin

I don't see that pattern snippet in the file you linked.

I see the problem now I've been confusing https://github.com/jeff-hykin/better-cpp-syntax/tree/master/syntaxes with https://github.com/jeff-hykin/better-cpp-syntax/tree/master/autogenerated syntaxes one is very much outdated

tho the issue is still present in the newer one

jeff-hykin commented 1 year ago

syntaxes one is very much outdated

Yeah, I should ask alexr if I can go ahead and delete syntaxes/. I'm pretty sure I can but I didn't want to potentially cause any breakages.

tho the issue is still present in the newer one

Alright @RedCMD , I think I found the source of the repeated pattern problem. Some composable patterns had std_space at the end, and others had it at the begining, so when they were used next to eachother they doubled up. I got rid of that, so take a look at the autogenerated/cpp.tmLanguage.json in this commit. Hopefully that change squashed all of the repeats.

RedCMD commented 1 year ago

still one left in operator_overload https://github.com/jeff-hykin/better-cpp-syntax/blob/190c42bd2b0ae514be32525e4b67122b52c1bb30/autogenerated/cpp.tmLanguage.json#L3595-L3596 image image

the generator really likes putting (?: ... ) groups around everything

jeff-hykin commented 1 year ago

the generator really likes putting (?: ... ) groups around everything

Yeah, its cause of modularity. Code like maybe( *any possible value* ) has to be generated first as (?: *any possible value* )? until its proven that unwrapping it to be *any possible value*? would behave equivlently. Right now its not doing full regex parsing, so theres a lot of stuff that could be unwrapped that isn't. For example, I don't think character classes are unwrapped so (?:[abc])? just stays as-is since it can't prove that unwrapping is safe.

We have the concept of "single_entity", which a and (?:abc) are a single entity, but abc is not And that determines if it needs to be wrapped or not. For example

If this funciton were more intelligent, it would get rid of any unnecessary wrappings: https://github.com/jeff-hykin/ruby_grammar_builder/blob/eb206ce1b1fbde4ce8013c67e262fdc2e7c8d2a1/main/lib/ruby_grammar_builder/util.rb#L53

I tried looking around for a generic regex minimizer/optimizer/uglifier a few years ago but didn't see anything of the sort.

jeff-hykin commented 1 year ago

Actually, looking at the source there were some obvious mistakes. I updated the library, and now there's 17,588 fewer chars in the generated grammar, and all the test output code is identical.

jeff-hykin commented 1 year ago

I know for a fact we optimized (?:a+)? to become a* but it looks like that also is not always happening

RedCMD commented 8 months ago

I can confirm the lag has been greatly reduced it no longer freezes VSCode, now only taking 1-2sec to tokenize

and with it being near instant for int var;

jeff-hykin commented 4 months ago

yep, same here. Looks like I can close this!