Closed alexdima closed 4 months ago
I'll test removing \G when I get a chance, although I'm pretty sure it will strongly effect correctness since it's not exactly a pattern that is just inserted casually.
I tested removing it, and substituting it, and doing either breaks 80% of all tests.
Can confirm function_definition
causing the issue
Tho the lag seems to be constituted around variable declaration rather than the struct code
int var;
int var; int var;
int var; int var; int var; int var;
int var; int var; int var; int var; int var; int var; int var; int var;
int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var;
int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var;
int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var;
int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var;
There is a double up of the comment handling group Disabling it seems to improve performance by 2x but does not fix it entirely
I'm curious
is (?>(?:\\s)+)
not worse than \\s++
?
performance wise?
I too also had performance issues with (^|\\G)
(in my own extension)
https://github.com/RedCMD/TmLanguage-Syntax-Highlighter/commit/2b02db24c6c74d653bd99f8593ea235914e96bb6
I found splitting the rule into two different ones "match": "^..."
and "match": "\\G..."
was enough to fix my issue
In some very specfic cases it was soo bad, that previously a 280,000 long line that could be parsed correctly, was reduced down to only 5000 characters being parsed correctly.
a line from this repo actually https://github.com/textmate/c.tmbundle/blob/80c8e886b67227096a56aca6b92fe1587f76d603/Syntaxes/Platform.tmLanguage#L299
@jeff-hykin will the double up of the comment handling capture group be removed?
I have found it to be causing 50% of the lag
((?:(?:(?:(?>(?:\\s)+)|(\\/\\*)((?:[^\\*]|(?:\\*)++[^\\/])*+((?:\\*)++\\/)))+)|(?:\\b)|(?=\\W)|(?<=\\W)|(?:\\A)|(?:\\Z)))
https://github.com/jeff-hykin/better-cpp-syntax/blob/562ec9a12ecc21179d54e9bc7bfae5f4cd4fff00/autogenerated/cpp.tmLanguage.json#L4144
Hmm @RedCMD I don't see that pattern snippet in the file you linked.
is (?>(?:\s)+) not worse than \s++?
Yes I think its worse, and the grammar generator should automatically convert (?>(?:\\s)+)
to \\s++
(along with other simplifcations like removing redundancy within capture groups).
However, I did notice last month I was using the original version of the grammar generator (which didn't have the optimization) so I updated it and regerated the grammar. So the last handful of releases shouldn't contain any (?>(?:\\s)+)
patterns
sorry. the exact snippet has changed a bit
but it is still there
((?:(?:(?:\\s*+(\\/\\*)((?:[^\\*]++|\\*+(?!\\/))*+(\\*\\/))\\s*+)+)|(?:\\s++)|(?<=\\W)|(?=\\W)|^|(?:\\n?$)|(?:\\A)|(?:\\Z)))
if you paste that twice in ctrl+f find, you will see it in destructor_inline
, function_definition
and operator_overload
https://github.com/jeff-hykin/better-cpp-syntax/blob/562ec9a12ecc21179d54e9bc7bfae5f4cd4fff00/autogenerated/cpp.tmLanguage.json#L2823-L2824
https://github.com/jeff-hykin/better-cpp-syntax/blob/562ec9a12ecc21179d54e9bc7bfae5f4cd4fff00/autogenerated/cpp.tmLanguage.json#L4143-L4144
https://github.com/jeff-hykin/better-cpp-syntax/blob/562ec9a12ecc21179d54e9bc7bfae5f4cd4fff00/autogenerated/cpp.tmLanguage.json#L7090-L7091
notice the double up of the 4 sets of captures
ahhhhh..... @jeff-hykin
I don't see that pattern snippet in the file you linked.
I see the problem now
I've been confusing https://github.com/jeff-hykin/better-cpp-syntax/tree/master/syntaxes with https://github.com/jeff-hykin/better-cpp-syntax/tree/master/autogenerated
syntaxes
one is very much outdated
tho the issue is still present in the newer one
syntaxes one is very much outdated
Yeah, I should ask alexr if I can go ahead and delete syntaxes/
. I'm pretty sure I can but I didn't want to potentially cause any breakages.
tho the issue is still present in the newer one
Alright @RedCMD , I think I found the source of the repeated pattern problem. Some composable patterns had std_space
at the end, and others had it at the begining, so when they were used next to eachother they doubled up. I got rid of that, so take a look at the autogenerated/cpp.tmLanguage.json
in this commit. Hopefully that change squashed all of the repeats.
still one left in operator_overload
https://github.com/jeff-hykin/better-cpp-syntax/blob/190c42bd2b0ae514be32525e4b67122b52c1bb30/autogenerated/cpp.tmLanguage.json#L3595-L3596
the generator really likes putting (?: ... )
groups around everything
the generator really likes putting (?: ... ) groups around everything
Yeah, its cause of modularity. Code like maybe( *any possible value* )
has to be generated first as (?: *any possible value* )?
until its proven that unwrapping it to be *any possible value*?
would behave equivlently. Right now its not doing full regex parsing, so theres a lot of stuff that could be unwrapped that isn't. For example, I don't think character classes are unwrapped so (?:[abc])?
just stays as-is since it can't prove that unwrapping is safe.
We have the concept of "single_entity", which a
and (?:abc)
are a single entity, but abc
is not
And that determines if it needs to be wrapped or not. For example
If this funciton were more intelligent, it would get rid of any unnecessary wrappings: https://github.com/jeff-hykin/ruby_grammar_builder/blob/eb206ce1b1fbde4ce8013c67e262fdc2e7c8d2a1/main/lib/ruby_grammar_builder/util.rb#L53
I tried looking around for a generic regex minimizer/optimizer/uglifier a few years ago but didn't see anything of the sort.
Actually, looking at the source there were some obvious mistakes. I updated the library, and now there's 17,588
fewer chars in the generated grammar, and all the test output code is identical.
I know for a fact we optimized (?:a+)?
to become a*
but it looks like that also is not always happening
I can confirm the lag has been greatly reduced it no longer freezes VSCode, now only taking 1-2sec to tokenize
and with it being near instant for int var;
yep, same here. Looks like I can close this!
Coming via https://github.com/microsoft/vscode/issues/117264
The regex for
function_definition
takes >1s to evaluate at each step. This makes VS Code freeze for some minutes until it eventually recovers. If the regex would stop using\G
, then we would cache its search results and even if it would still be slow, at least the cost would be 1s in total for the entire line. As it is right now, the usage of\G
prohibits us from caching the search results, which means we need to re-search at each necessary offset (at each offset where tokenization advances).the regex
``` (?:(?:^|\G|(?<=;|\}))|(?<=>))((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))(?:((?(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z))))?((?:(?:(?:\[\[.*?\]\]|__attribute(?:__)?\s*\(\s*\(.*?\)\s*\))|__declspec\(.*?\))|alignas\(.*?\))(?!\)))?((?:((?(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z))))*)(\s*+((?:(?:(?:\[\[.*?\]\]|__attribute(?:__)?\s*\(\s*\(.*?\)\s*\))|__declspec\(.*?\))|alignas\(.*?\))(?!\)))?((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))(?:(?:(?:(?:unsigned)|(?:signed)|(?:short)|(?:long))|(?:(?:struct)|(?:class)|(?:union)|(?:enum)))((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z))))*((?:::)?(?:(?!\b(?:__has_cpp_attribute|reinterpret_cast|atomic_noexcept|atomic_commit|atomic_cancel|__has_include|synchronized|dynamic_cast|thread_local|static_cast|const_cast|co_return|constexpr|constexpr|constexpr|co_return|protected|namespace|consteval|noexcept|decltype|template|operator|noexcept|co_yield|co_await|reflexpr|continue|co_await|co_yield|requires|volatile|register|restrict|explicit|volatile|noexcept|typename|default|_Pragma|mutable|include|concept|alignas|virtual|alignof|__asm__|defined|mutable|typedef|warning|private|and_eq|define|pragma|typeid|switch|bitand|return|ifndef|export|struct|sizeof|module|static|public|extern|inline|friend|delete|xor_eq|import|not_eq|class|compl|bitor|throw|or_eq|while|catch|break|union|const|const|endif|ifdef|undef|error|using|else|line|goto|else|elif|this|enum|case|new|asm|not|try|for|and|xor|or|if|do|if)\b)(?]*+|"(?:[^"]*|\\")")|'(?:[^']*|\\')')\g<60>?)+>)(?:\s)*+)?::)*+)?((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))(?!(?:(?:transaction_safe_dynamic)|(?:__has_cpp_attribute)|(?:reinterpret_cast)|(?:transaction_safe)|(?:atomic_noexcept)|(?:atomic_commit)|(?:atomic_cancel)|(?:__has_include)|(?:dynamic_cast)|(?:synchronized)|(?:thread_local)|(?:static_cast)|(?:const_cast)|(?:constexpr)|(?:consteval)|(?:co_return)|(?:co_return)|(?:constexpr)|(?:protected)|(?:constexpr)|(?:namespace)|(?:noexcept)|(?:typename)|(?:decltype)|(?:template)|(?:operator)|(?:noexcept)|(?:co_yield)|(?:co_await)|(?:continue)|(?:co_await)|(?:co_yield)|(?:volatile)|(?:register)|(?:restrict)|(?:explicit)|(?:override)|(?:volatile)|(?:reflexpr)|(?:noexcept)|(?:requires)|(?:alignas)|(?:typedef)|(?:nullptr)|(?:alignof)|(?:mutable)|(?:concept)|(?:virtual)|(?:defined)|(?:__asm__)|(?:include)|(?:_Pragma)|(?:mutable)|(?:default)|(?:warning)|(?:private)|(?:module)|(?:return)|(?:not_eq)|(?:xor_eq)|(?:and_eq)|(?:ifndef)|(?:pragma)|(?:export)|(?:import)|(?:sizeof)|(?:static)|(?:delete)|(?:public)|(?:define)|(?:extern)|(?:inline)|(?:typeid)|(?:switch)|(?:friend)|(?:bitand)|(?:false)|(?:compl)|(?:bitor)|(?:throw)|(?:or_eq)|(?:while)|(?:catch)|(?:break)|(?:const)|(?:final)|(?:const)|(?:endif)|(?:ifdef)|(?:undef)|(?:error)|(?:using)|(?:audit)|(?:axiom)|(?:line)|(?:else)|(?:elif)|(?:true)|(?:NULL)|(?:case)|(?:goto)|(?:else)|(?:this)|(?:new)|(?:asm)|(?:not)|(?:and)|(?:xor)|(?:try)|(?:for)|(?:if)|(?:do)|(?:or)|(?:if))\b)(?:[a-zA-Z_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))(?:[a-zA-Z0-9_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))*\b((?]*+|"(?:[^"]*|\\")")|'(?:[^']*|\\')')\g<60>?)+>)?(?![\w<:.]))(((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))?(?:(?:&|(?:\*))((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z))))*(?:&|(?:\*)))?((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))((?:__cdecl|__clrcall|__stdcall|__fastcall|__thiscall|__vectorcall)?)((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))((::)?(?:(?!\b(?:__has_cpp_attribute|reinterpret_cast|atomic_noexcept|atomic_commit|atomic_cancel|__has_include|synchronized|dynamic_cast|thread_local|static_cast|const_cast|co_return|constexpr|constexpr|constexpr|co_return|protected|namespace|consteval|noexcept|decltype|template|operator|noexcept|co_yield|co_await|reflexpr|continue|co_await|co_yield|requires|volatile|register|restrict|explicit|volatile|noexcept|typename|default|_Pragma|mutable|include|concept|alignas|virtual|alignof|__asm__|defined|mutable|typedef|warning|private|and_eq|define|pragma|typeid|switch|bitand|return|ifndef|export|struct|sizeof|module|static|public|extern|inline|friend|delete|xor_eq|import|not_eq|class|compl|bitor|throw|or_eq|while|catch|break|union|const|const|endif|ifdef|undef|error|using|else|line|goto|else|elif|this|enum|case|new|asm|not|try|for|and|xor|or|if|do|if)\b)(?]*+|"(?:[^"]*|\\")")|'(?:[^']*|\\')')\g<60>?)+>)(?:\s)*+)?::)*\s*+)((?:[a-zA-Z_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))(?:[a-zA-Z0-9_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))*)\b(?(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))(?=\() ```the test string
``` typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; ```Here I have tested the regex at regex101.com and it leads to catastrophic backtracking: