[POC] Auto generate optimized regex to make .sublime-syntax more maintainable

Problem

The original discussion is at https://github.com/sublimehq/Packages/pull/1740#issuecomment-449503763.

As a tl;dr, please see this diff. It's really hard to just add new tokens into old rules because those regexes are "optimized" with the cost of readability (maintainability).

Proposed Solution

We store those tokens in a new file, say parser-token.txt which looks like

T_ABSTRACT
T_AND_EQUAL
T_ARRAY
T_ARRAY_CAST
T_AS
T_BAD_CHARACTER
T_BOOLEAN_AND
... (lots of lines)

We write a .sublime-syntax template, say parser-token.tpl.yaml, for it.

%YAML 1.2
---
name: PHP {#tokens_file#}
hidden: true
scope: source.php

contexts:
  main:
    - match: (\\)?\b{#regex_optimized#}\b
      scope: support.constant.parser-token.php
      captures:
        1: punctuation.separator.namespace.php

Then, we write a python script, say compile_templates.py, to do the following jobs.

load tokens from parser-token.txt
load its corresponding template parser-token.tpl.yaml

generate optimized regex by the content of parser-token.txt

(?:T_(?:A(?:BSTRACT|ND_EQUAL|RRAY(?:_CAST|)|S)|B(?:AD_CHARACTER|OOLEAN_AND))|...)

generate compiled syntax file parser-token.sublime-syntax with the template and the regex

%YAML 1.2
---
name: PHP parser-token
hidden: true
scope: source.php

contexts:
 main:
   - match: (\\)?\b(?:T_(?:A(?:BSTRACT|ND_EQUAL|RRAY(?:_CAST|)|S)|...)\b
     scope: support.constant.parser-token.php
     captures:
       1: punctuation.separator.namespace.php

change PHP Source.sublime-syntax to include the compiled syntax file

-    - match: (\\)?\bT_(RE(TURN|QUIRE(_ONCE)?)|G(OTO|LOBAL)|...)\b
-      scope: support.constant.parser-token.php
-      captures:
-       1: punctuation.separator.namespace.php
+    - include: 'Packages/PHP/Tokens/compiled/parser-token.sublime-syntax'

Now, we maintain parser-token.txt rather than the unreadable optimized regex.

Proof of Concept

https://github.com/jfcherng/Packages/tree/poc-regex-optimize/PHP/Tokens

Related Scripts

To expand regex: https://git.io/fjHyi
To compact regex: https://git.io/fjHy6

I might have misunderstood the performance tests Briles did that are linked from #1740, but I thought the outcome was that "compacted" regex keywords were actually slower than human-readable ones.

Since that PR was merged, I assume that @wbond agreed to expand those compact regexes. If the performance becomes even better after expanding them, then the only gain for a compact one is probably the cached file size which is mentioned in https://github.com/sublimehq/Packages/pull/1740#issuecomment-449558473 by @deathaxe.

The problem becomes that does the cache file size matter? What stops us from expanding them?

What stops us from expanding them?

I believe the only reason that stops us from doing that currently is that nobody's taken the time to do it (and written a script that automates the process). Note that some human discretion is required to declare a pattern as "human readable" and that human-readable and compacted regexes aren't exclusive.

Has anyone replicated the original performance tests using non-capturing groups?

I believe the only reason from doing that currently is that nobody's taken the time to do it (and written a script that automates the process).

https://github.com/asciimoo/exrex can be used to expand those compacted regexes. But beware that if there is a dash in the regex... \b(?:A|A-B)\b matches only A in string A-B. Sort components with the following compare function seems to be good.

# used to prevent the following situation (for example, for CSS syntax)
# if there is a dash in the regex, "\b(?:A|A-B)\b" matches only "A" in string "A-B".
def token_sort(a: str, b: str) -> float:
    def has_punctuation(s: str) -> bool:
        return not re.match(r"^[0-9a-zA-Z_]+$", s)

    if a == b:
        return 0

    if a.startswith(b) and has_punctuation(a):
        return -1

    if b.startswith(a) and has_punctuation(b):
        return 1

    if a > b:
        return 1

    return -1

Note that some human discretion is required to declare a pattern as "human readable" and that human-readable and compacted regexes aren't exclusive.

I think we do just like that PR https://github.com/sublimehq/Packages/pull/349/files. Each token should be plain text (no regex is allowed, not even parentheses).

Has anyone replicated the original performance tests using non-capturing groups?

Interesting. There are indeed quite lots of unused capture groups in the original comparison test.

Has anyone replicated the original performance tests using non-capturing groups?

I use the test case privided in https://github.com/sublimehq/Packages/pull/349#issuecomment-216545283.

The average time it takes to parse the test case over 10 runs on my machine:

Current master: 29X ms

Regex
``` display|width|background-color|height|position|font-family|font-weight | top|opacity|cursor|background-image|right|visibility|box-sizing | user-select|left|float|margin-left|margin-top|line-height | padding-left|z-index|margin-bottom|margin-right|margin | vertical-align|padding-top|white-space|border-radius|padding-bottom | padding-right|padding|bottom|clear|max-width|box-shadow|content | border-color|min-height|min-width|font-style|border-width | border-collapse|background-size|text-overflow|max-height|text-transform | text-shadow|text-indent|border-style|overflow-y|list-style-type | word-wrap|border-spacing|appearance|zoom|overflow-x|border-top-left-radius | border-bottom-left-radius|border-top-color|pointer-events | border-bottom-color|align-items|justify-content|letter-spacing | border-top-right-radius|border-bottom-right-radius|border-right-width | font-smoothing|border-bottom-width|border-right-color|direction | border-top-width|src|border-left-color|border-left-width | tap-highlight-color|table-layout|background-clip|word-break | transform-origin|resize|filter|backface-visibility|text-rendering | box-orient|transition-property|transition-duration|word-spacing | quotes|outline-offset|animation-timing-function|animation-duration | animation-name|transition-timing-function|border-bottom-style | border-bottom|transition-delay|transition|unicode-bidi|border-top-style | border-top|unicode-range|list-style-position|orphans|outline-width | line-clamp|order|flex-direction|box-pack|animation-fill-mode | outline-color|list-style-image|list-style|touch-action|flex-grow | border-left-style|border-left|animation-iteration-count | page-break-inside|box-flex|box-align|page-break-after|animation-delay | widows|border-right-style|border-right|flex-align|outline-style | outline|background-origin|animation-direction|fill-opacity | background-attachment|flex-wrap|transform-style|counter-increment | overflow-wrap|counter-reset|animation-play-state|animation | will-change|box-ordinal-group|image-rendering|mask-image|flex-flow | background-position-y|stroke-width|background-position-x|background-position | background-blend-mode|flex-shrink|flex-basis|flex-order|flex-item-align | flex-line-pack|flex-negative|flex-pack|flex-positive|flex-preferred-size | flex|user-drag|font-stretch|column-count|empty-cells|align-self | caption-side|mask-size|column-gap|mask-repeat|box-direction | font-feature-settings|mask-position|align-content|object-fit | columns|text-fill-color|clip-path|stop-color|font-kerning | page-break-before|stroke-dasharray|size|fill-rule|border-image-slice | column-width|break-inside|column-break-before|border-image-width | stroke-dashoffset|border-image-repeat|border-image-outset|line-break | stroke-linejoin|stroke-linecap|stroke-miterlimit|stroke-opacity | stroke|shape-rendering|border-image-source|border-image|border | tab-size|writing-mode|perspective-origin-y|perspective-origin-x | perspective-origin|perspective|text-align-last|text-align|clip-rule | clip|text-anchor|column-rule-color|box-decoration-break|column-fill | fill|column-rule-style|mix-blend-mode|text-emphasis-color | baseline-shift|dominant-baseline|page|alignment-baseline | column-rule-width|column-rule|break-after|font-variant-ligatures | transform-origin-y|transform-origin-x|transform|object-position | break-before|column-span|isolation|shape-outside|all | color-interpolation-filters|marker|marker-end|marker-start | marker-mid|color-rendering|color-interpolation|background-repeat-x | background-repeat-y|background-repeat|background|mask-type | flood-color|flood-opacity|text-orientation|mask-composite | text-emphasis-style|paint-order|lighting-color|shape-margin | text-emphasis-position|text-emphasis|shape-image-threshold | mask-clip|mask-origin|mask|font-variant-caps|font-variant-alternates | font-variant-east-asian|font-variant-numeric|font-variant-position | font-variant|font-size-adjust|font-size|font-language-override | font-display|font-synthesis|font|line-box-contain|text-justify | text-decoration-color|text-decoration-style|text-decoration-line | text-decoration|text-underline-position|grid-template-rows | grid-template-columns|grid-template-areas|grid-template|rotate|scale | translate|scroll-behavior|grid-column-start|grid-column-end | grid-column-gap|grid-row-start|grid-row-end|grid-auto-rows | grid-area|grid-auto-flow|grid-auto-columns|image-orientation | hyphens|overflow-scrolling|overflow|color-profile|kerning | nbsp-mode|color|image-resolution|grid-row-gap|grid-row|grid-column | blend-mode|azimuth|pause-after|pause-before|pause|pitch-range|pitch | text-height|system|negative|prefix|suffix|range|pad|fallback | additive-symbols|symbols|speak-as|speak|grid-gap ```
Current master + Undo the PR: 29X ms

Regex
``` align(-content|-items|-self)? | alignment-(adjust|baseline) | all | animation(-delay|-direction|-duration|-fill-mode|-iteration-count|-name|-play-state|-timing-function)? | appearance|azimuth|backface-visibility | background(-attachment|-blend-mode|-clip|-color|-image|-origin|-position|-repeat|-size)? | baseline-shift|behavior | border(-bottom|-left|-right|-top)?-(color|radius|width|style) | border-(bottom|top)-(left|right)-radius | border-image(-outset|-repeat|-slice|-source|-width)? | border(-bottom|-left|-right|-top|-collapse|-spacing)? | bottom | box-(align|decoration-break|direction|flex|ordinal-group|orient|pack|shadow|sizing) | break-(after|before|inside) | caption-side|clear | clip(-path|-rule)? | color(-interpolation(-filters)?|-profile|-rendering)? | column((-count|-fill|-gap|-rule(-color|-style|-width)?|-span|-width)|s)? | content | counter-(increment|reset) | cue(-after|-before)? | cursor|direction|display|dominant-baseline|elevation|empty-cells | fill(-rule)? | filter | flex((-item)?-align|-basis|-direction|-flow|-grow|(-line)?-pack |-negative|-order|-positive|-preferred-size|-shrink|-wrap)? | float|flood-color | font-emphasize(-position|-style)? | font-variant(-alternates|-caps|-east-asian|-ligatures|-numeric|-position)? | font(-effect|-family|-feature-settings|-kerning|-language-override |-size(-adjust)?|-smoothing|-stretch|-style|-synthesis|-weight)? | force-broken-image-icon | grid-area | grid-auto-(columns|flow|rows) | grid-(column|row)(-end|-start)? | grid-template(-areas|-columns|-rows)? | group|height|hyphens | image-(orientation|rendering|resolution) | ime-mode | initial-letter(-align)? | inline-(box-align|flex) | interpolation-mode|isolation|justify-content|kerning|left|letter-spacing|lighting-color | line-(box-contain|clamp|height|stacking(-ruby|-shift|-strategy)?) | list-style(-image|-position|-type)? | margin(-bottom|-left|-right|-top)? | marker-(end|mid|offset|start) | marks | mask-border(-mode|-outset|-repeat|-slice|-source|-width)? | mask(-clip|-composite|-image|-mode|-origin|-position|-repeat|-size|-type)? | (max|min)-(height|width) | mix-blend-mode | nbsp-mode | object-(fit|position) | opacity|order|orphans|osx-font-smoothing | outline(-color|-offset|-style|-width)? | overflow(-scrolling|-wrap|-x|-y)? | padding(-bottom|-left|-right|-top)? | page(-break-(after|before|inside))? | paint-order | pause(-after|-before)? | perspective(-origin)? | pitch(-range)? | play-during|pointer-events|position|quotes|resize|richness|right|scroll-behavior | shape-(image-threshold|margin|outside|rendering) | size | speak(-header|-numeral|-punctuation)? | speech-rate|src|stop-color|stress | stroke(-dasharray|-line(cap|join))? | tab-size|table-layout|tap-highlight-color | text-align(-last)? | text-decoration(-color|-line|-style)? | text-emphasis(-color|-position|-style)? | text-(anchor|fill-color|height|indent|justify|orientation|overflow |rendering|shadow|size-adjust|transform|underline-position) | top | touch-(action|callout) | transform(-origin(-[xy])?|-style)? | transition(-delay|-duration|-property|-timing-function)? | unicode-(bidi|range) | user-select|vertical-align|visibility|voice-family|volume|white-space|widows|width|will-change | word-(break|spacing|wrap) | writing-mode|z-index|zoom ```
Current master + Undo the PR + Use non-capture groups (?:): 29X ms

Regex
``` align(?:-content|-items|-self)? | alignment-(?:adjust|baseline) | all | animation(?:-delay|-direction|-duration|-fill-mode|-iteration-count|-name|-play-state|-timing-function)? | appearance|azimuth|backface-visibility | background(?:-attachment|-blend-mode|-clip|-color|-image|-origin|-position|-repeat|-size)? | baseline-shift|behavior | border(?:-bottom|-left|-right|-top)?-(?:color|radius|width|style) | border-(?:bottom|top)-(?:left|right)-radius | border-image(?:-outset|-repeat|-slice|-source|-width)? | border(?:-bottom|-left|-right|-top|-collapse|-spacing)? | bottom | box-(?:align|decoration-break|direction|flex|ordinal-group|orient|pack|shadow|sizing) | break-(?:after|before|inside) | caption-side|clear | clip(?:-path|-rule)? | color(?:-interpolation(?:-filters)?|-profile|-rendering)? | column(?:(?:-count|-fill|-gap|-rule(?:-color|-style|-width)?|-span|-width)|s)? | content | counter-(?:increment|reset) | cue(?:-after|-before)? | cursor|direction|display|dominant-baseline|elevation|empty-cells | fill(?:-rule)? | filter | flex(?:(?:-item)?-align|-basis|-direction|-flow|-grow|(?:-line)?-pack |-negative|-order|-positive|-preferred-size|-shrink|-wrap)? | float|flood-color | font-emphasize(?:-position|-style)? | font-variant(?:-alternates|-caps|-east-asian|-ligatures|-numeric|-position)? | font(?:-effect|-family|-feature-settings|-kerning|-language-override |-size(?:-adjust)?|-smoothing|-stretch|-style|-synthesis|-weight)? | force-broken-image-icon | grid-area | grid-auto-(?:columns|flow|rows) | grid-(?:column|row)(?:-end|-start)? | grid-template(?:-areas|-columns|-rows)? | group|height|hyphens | image-(?:orientation|rendering|resolution) | ime-mode | initial-letter(?:-align)? | inline-(?:box-align|flex) | interpolation-mode|isolation|justify-content|kerning|left|letter-spacing|lighting-color | line-(?:box-contain|clamp|height|stacking(?:-ruby|-shift|-strategy)?) | list-style(?:-image|-position|-type)? | margin(?:-bottom|-left|-right|-top)? | marker-(?:end|mid|offset|start) | marks | mask-border(?:-mode|-outset|-repeat|-slice|-source|-width)? | mask(?:-clip|-composite|-image|-mode|-origin|-position|-repeat|-size|-type)? | (?:max|min)-(?:height|width) | mix-blend-mode | nbsp-mode | object-(?:fit|position) | opacity|order|orphans|osx-font-smoothing | outline(?:-color|-offset|-style|-width)? | overflow(?:-scrolling|-wrap|-x|-y)? | padding(?:-bottom|-left|-right|-top)? | page(?:-break-(?:after|before|inside))? | paint-order | pause(?:-after|-before)? | perspective(?:-origin)? | pitch(?:-range)? | play-during|pointer-events|position|quotes|resize|richness|right|scroll-behavior | shape-(?:image-threshold|margin|outside|rendering) | size | speak(?:-header|-numeral|-punctuation)? | speech-rate|src|stop-color|stress | stroke(?:-dasharray|-line(?:cap|join))? | tab-size|table-layout|tap-highlight-color | text-align(?:-last)? | text-decoration(?:-color|-line|-style)? | text-emphasis(?:-color|-position|-style)? | text-(?:anchor|fill-color|height|indent|justify|orientation|overflow |rendering|shadow|size-adjust|transform|underline-position) | top | touch-(?:action|callout) | transform(?:-origin(?:-[xy])?|-style)? | transition(?:-delay|-duration|-property|-timing-function)? | unicode-(?:bidi|range) | user-select|vertical-align|visibility|voice-family|volume|white-space|widows|width|will-change | word-(?:break|spacing|wrap) | writing-mode|z-index|zoom ```
Current master + Undo the PR + Max Compacted: 29X ms

Regex
Generated using this POC. ``` (?:a(?:l(?:ign(?:-(?:content|items|self)|ment-baseline)|l)|nimation(?:-(?:d(?:elay|irection|uration)|fill-mode|iteration-count|name|play-state|timing-function)|)|ppearance|zimuth)|b(?:a(?:ck(?:face-visibility|ground(?:-(?:attachment|blend-mode|c(?:lip|olor)|image|origin|position(?:-(?:x|y)|)|repeat(?:-(?:x|y)|)|size)|))|seline-shift)|lend-mode|o(?:rder(?:-(?:bottom(?:-(?:color|left-radius|right-radius|style|width)|)|col(?:lapse|or)|image(?:-(?:outset|repeat|s(?:lice|ource)|width)|)|left(?:-(?:color|style|width)|)|r(?:adius|ight(?:-(?:color|style|width)|))|s(?:pacing|tyle)|top(?:-(?:color|left-radius|right-radius|style|width)|)|width)|)|ttom|x-(?:align|d(?:ecoration-break|irection)|flex|or(?:dinal-group|ient)|pack|s(?:hadow|izing)))|reak-(?:after|before|inside))|c(?:aption-side|l(?:ear|ip(?:-(?:path|rule)|))|o(?:l(?:or(?:-(?:interpolation(?:-filters|)|profile|rendering)|)|umn(?:-(?:break-before|count|fill|gap|rule(?:-(?:color|style|width)|)|span|width)|s))|ntent|unter-(?:increment|reset))|ursor)|d(?:i(?:rection|splay)|ominant-baseline)|empty-cells|f(?:il(?:l(?:-(?:opacity|rule)|)|ter)|l(?:ex(?:-(?:align|basis|direction|flow|grow|item-align|line-pack|negative|order|p(?:ack|ositive|referred-size)|shrink|wrap)|)|o(?:at|od-(?:color|opacity)))|ont(?:-(?:display|f(?:amily|eature-settings)|kerning|language-override|s(?:ize(?:-adjust|)|moothing|t(?:retch|yle)|ynthesis)|variant(?:-(?:alternates|caps|east-asian|ligatures|numeric|position)|)|weight)|))|grid-(?:a(?:rea|uto-(?:columns|flow|rows))|column(?:-(?:end|gap|start)|)|row(?:-(?:end|start)|)|template(?:-(?:areas|columns|rows)|))|h(?:eight|yphens)|i(?:mage-(?:orientation|re(?:ndering|solution))|solation)|justify-content|kerning|l(?:e(?:ft|tter-spacing)|i(?:ghting-color|ne-(?:b(?:ox-contain|reak)|clamp|height)|st-style(?:-(?:image|position|type)|)))|m(?:a(?:r(?:gin(?:-(?:bottom|left|right|top)|)|ker(?:-(?:end|mid|start)|))|sk(?:-(?:c(?:lip|omposite)|image|origin|position|repeat|size|type)|)|x-(?:height|width))|i(?:n-(?:height|width)|x-blend-mode))|nbsp-mode|o(?:bject-(?:fit|position)|pacity|r(?:der|phans)|utline(?:-(?:color|offset|style|width)|)|verflow(?:-(?:scrolling|wrap|x|y)|))|p(?:a(?:dding(?:-(?:bottom|left|right|top)|)|ge(?:-break-(?:after|before|inside)|)|int-order|use(?:-(?:after|before)|))|erspective(?:-origin(?:-(?:x|y)|)|)|itch(?:-range|)|o(?:inter-events|sition))|quotes|r(?:esize|ight|otate)|s(?:c(?:ale|roll-behavior)|hape-(?:image-threshold|margin|outside|rendering)|ize|peak|rc|t(?:op-color|roke(?:-(?:dash(?:array|offset)|line(?:cap|join)|miterlimit|opacity|width)|)))|t(?:a(?:b(?:-size|le-layout)|p-highlight-color)|ext-(?:a(?:lign(?:-last|)|nchor)|decoration(?:-(?:color|line|style)|)|emphasis(?:-(?:color|position|style)|)|fill-color|height|indent|justify|o(?:rientation|verflow)|rendering|shadow|transform|underline-position)|o(?:p|uch-action)|rans(?:form(?:-(?:origin(?:-(?:x|y)|)|style)|)|ition(?:-(?:d(?:elay|uration)|property|timing-function)|)|late))|u(?:nicode-(?:bidi|range)|ser-(?:drag|select))|v(?:ertical-align|isibility)|w(?:hite-space|i(?:d(?:ows|th)|ll-change)|ord-(?:break|spacing|wrap)|riting-mode)|z(?:-index|oom)) ```

All cases are quite the same.

In @Briles' benchmark results, the best case (current master) is 348.58 ms and the worst case Max Compacted is 477.68 ms. I cannot reproduce that difference. In my test, the Max Compacted case is no noticeable difference from a Usage Sorted (current master branch's, untouched) one.

Here is the fully max compacted version CSS syntax modified from the current master branch. But, again, no noticeable performance difference, still 29X ms.

The cached file size Sublime Text 3/Cache/CSS/CSS.sublime-syntax.rcache:

Current master: 146 KB
After undoing the PR: 145 KB
fully max compacted: 123 KB

I've always been under the impression that Sublime compiled all of the regexps from a context into a single finite state machine. Once this is done, matching is incredibly simple, and performance is guaranteed linear in the number of characters examined:

bool match(
    int initialState,
    int* doesStateAccept,
    int** transitionTable,
    char* string,
    int begin,
    int end,
) {
    int state = initialState;
    for (int i = begin; i < end; i++) {
        state = transitionTable[state][string[i]];
        if (doesStateAccept[state]) return true;
    }
    return false;
}

If two regexps are equivalent (match the same set of strings), then they should produce identical FSMs, no matter how thoroughly golfed or “optimized” the expressions may be. This is in contrast to slower-but-more-powerful backtracking-based engines like Oniguruma, in which the structure of the expression can greatly affect performance. (This is why Oniguruma has atomic groups, but FSM's don't need them.)

The above does not account for captures, because I don't know how Sublime implements them. Clearly, tracking capture groups requires more work than not tracking them requires. And we know that Sublime doesn't optimize away unnecessary capture groups, because that would visibly affect tokenization. In #1614, we concluded that these extra captures do measurably affect performance, although perhaps the difference is not noticeable at smaller scales.

If rearranging a capture-less, non-Oniguruma regexp without changing its meaning has any consistent measurable effect on parsing performance, then this calls my understanding of Sublime's regexp engine into serious question, and I thought I had a pretty solid handle on this. Perhaps @wbond or someone from Sublime HQ might be able to confirm or refute my assumptions. If it would affect the way we should write the core syntaxes, then it would be nice to get an authoritative statement on this.

There are quite lots of compacted regex in PHP's syntax.

Now I have a fully expanded PHP syntax file.

Performance

The benchmarked file is syntax_test_php.php.

Before expanded (current master): 16.X ms on avg over 10 runs
After expanded (as above gist): 16.X ms on avg over 10 runs

No noticeable difference.

`PHP.sublime-syntax.rcache` sizes

Before expanded (current master): 3688 KB
After expanded (as above gist): 3788 KB

...?

I do another test with this file to test what happens if there are LOTS of functions in a PHP file. I collect all functions that are expanded into the single PHP file as a test case.

Before expanded (current master): ~40 ms on avg over 10 runs
After expanded (as above gist): ~50 ms on avg over 10 runs

Hmm... 25% extra cost in this case.

Those two syntaxes don't seem to be equivalent. For instance, the support.constant.std.php expression differs: the expanded list contains __COMPILER_HALT_OFFSET__, whereas the compacted version does not seem to match that identifier.

I ran my own benchmark to zero in on the regexp format question. The test syntaxes are all of the following form:

%YAML 1.2
---
name: PHP Text X
scope: source.php
variables:
  name: # expression
contexts:
  main:
    - match: '{{name}}'
      scope: region.redish
      # set: main

The expression is some representation of the support.constant.ext.php expression from the core PHP syntax: either the compact representation as it is in the syntax, or that version with the capturing groups replaced by noncapturing groups, or @jfcherng's fully expanded version. I removed the punctuation match. The compact version with captures:

\b(GLOB_(MARK|BRACE|NO(SORT|CHECK|ESCAPE)|ONLYDIR|ERR|AVAILABLE_FLAGS)|XML_(SAX_IMPL|HTML_DOCUMENT_NODE|N(OTATION_NODE|AMESPACE_DECL_NODE)|C(OMMENT_NODE|DATA_SECTION_NODE)|TEXT_NODE|OPTION_(SKIP_(TAGSTART|WHITE)|CASE_FOLDING|TARGET_ENCODING)|D(TD_NODE|OCUMENT_(NODE|TYPE_NODE|FRAG_NODE))|PI_NODE|E(RROR_(RECURSIVE_ENTITY_REF|MISPLACED_XML_PI|B(INARY_ENTITY_REF|AD_CHAR_REF)|SYNTAX|NO(NE|_(MEMORY|ELEMENTS))|TAG_MISMATCH|IN(CORRECT_ENCODING|VALID_TOKEN)|DUPLICATE_ATTRIBUTE|UN(CLOSED_(CDATA_SECTION|TOKEN)|DEFINED_ENTITY|KNOWN_ENCODING)|JUNK_AFTER_DOC_ELEMENT|PAR(TIAL_CHAR|AM_ENTITY_REF)|EXTERNAL_ENTITY_HANDLING|A(SYNC_ENTITY|TTRIBUTE_EXTERNAL_ENTITY_REF))|NTITY_(REF_NODE|NODE|DECL_NODE)|LEMENT_(NODE|DECL_NODE))|LOCAL_NAMESPACE|ATTRIBUTE_(N(MTOKEN(S)?|O(TATION|DE))|CDATA|ID(REF(S)?)?|DECL_NODE|EN(TITY|UMERATION)))|M(HASH_(RIPEMD(1(28|60)|256|320)|GOST|MD(2|4|5)|S(HA(1|2(24|56)|384|512)|NEFRU256)|HAVAL(1(28|92|60)|2(24|56))|CRC32(B)?|TIGER(1(28|60))?|WHIRLPOOL|ADLER32)|YSQL(_(BOTH|NUM|CLIENT_(SSL|COMPRESS|I(GNORE_SPACE|NTERACTIVE))|ASSOC)|I_(RE(PORT_(STRICT|INDEX|OFF|ERROR|ALL)|FRESH_(GRANT|MASTER|BACKUP_LOG|S(TATUS|LAVE)|HOSTS|T(HREADS|ABLES)|LOG)|AD_DEFAULT_(GROUP|FILE))|GROUP_FLAG|MULTIPLE_KEY_FLAG|B(INARY_FLAG|OTH|LOB_FLAG)|S(T(MT_ATTR_(CURSOR_TYPE|UPDATE_MAX_LENGTH|PREFETCH_ROWS)|ORE_RESULT)|E(RVER_QUERY_(NO_(GOOD_INDEX_USED|INDEX_USED)|WAS_SLOW)|T_(CHARSET_NAME|FLAG)))|N(O(_D(EFAULT_VALUE_FLAG|ATA)|T_NULL_FLAG)|UM(_FLAG)?)|C(URSOR_TYPE_(READ_ONLY|SCROLLABLE|NO_CURSOR|FOR_UPDATE)|LIENT_(SSL|NO_SCHEMA|COMPRESS|I(GNORE_SPACE|NTERACTIVE)|FOUND_ROWS))|T(YPE_(GEOMETRY|MEDIUM_BLOB|B(IT|LOB)|S(HORT|TRING|ET)|YEAR|N(ULL|EWD(ECIMAL|ATE))|CHAR|TI(ME(STAMP)?|NY(_BLOB)?)|INT(24|ERVAL)|D(OUBLE|ECIMAL|ATE(TIME)?)|ENUM|VAR_STRING|FLOAT|LONG(_BLOB|LONG)?)|IMESTAMP_FLAG)|INIT_COMMAND|ZEROFILL_FLAG|O(N_UPDATE_NOW_FLAG|PT_(NET_(READ_BUFFER_SIZE|CMD_BUFFER_SIZE)|CONNECT_TIMEOUT|INT_AND_FLOAT_NATIVE|LOCAL_INFILE))|D(EBUG_TRACE_ENABLED|ATA_TRUNCATED)|U(SE_RESULT|N(SIGNED_FLAG|IQUE_KEY_FLAG))|P(RI_KEY_FLAG|ART_KEY_FLAG)|ENUM_FLAG|A(S(SOC|YNC)|UTO_INCREMENT_FLAG)))|CRYPT_(R(C(2|6)|IJNDAEL_(1(28|92)|256)|AND)|GOST|XTEA|M(ODE_(STREAM|NOFB|C(BC|FB)|OFB|ECB)|ARS)|BLOWFISH(_COMPAT)?|S(ERPENT|KIPJACK|AFER(128|PLUS|64))|C(RYPT|AST_(128|256))|T(RIPLEDES|HREEWAY|WOFISH)|IDEA|3DES|DE(S|CRYPT|V_(RANDOM|URANDOM))|PANAMA|EN(CRYPT|IGNA)|WAKE|LOKI97|ARCFOUR(_IV)?))|S(TREAM_(REPORT_ERRORS|M(UST_SEEK|KDIR_RECURSIVE)|BUFFER_(NONE|FULL|LINE)|S(HUT_(RD(WR)?|WR)|OCK_(R(DM|AW)|S(TREAM|EQPACKET)|DGRAM)|ERVER_(BIND|LISTEN))|NOTIFY_(RE(SOLVE|DIRECTED)|MIME_TYPE_IS|SEVERITY_(INFO|ERR|WARN)|CO(MPLETED|NNECT)|PROGRESS|F(ILE_SIZE_IS|AILURE)|AUTH_RE(SULT|QUIRED))|C(RYPTO_METHOD_(SSLv(2(_(SERVER|CLIENT)|3_(SERVER|CLIENT))|3_(SERVER|CLIENT))|TLS_(SERVER|CLIENT))|LIENT_(CONNECT|PERSISTENT|ASYNC_CONNECT)|AST_(FOR_SELECT|AS_STREAM))|I(GNORE_URL|S_URL|PPROTO_(RAW|TCP|I(CMP|P)|UDP))|O(OB|PTION_(READ_(BUFFER|TIMEOUT)|BLOCKING|WRITE_BUFFER))|U(RL_STAT_(QUIET|LINK)|SE_PATH)|P(EEK|F_(INET(6)?|UNIX))|ENFORCE_SAFE_MODE|FILTER_(READ|WRITE|ALL))|UNFUNCS_RET_(STRING|TIMESTAMP|DOUBLE)|QLITE(_(R(OW|EADONLY)|MIS(MATCH|USE)|B(OTH|USY)|SCHEMA|N(O(MEM|T(FOUND|ADB)|LFS)|UM)|C(O(RRUPT|NSTRAINT)|ANTOPEN)|TOOBIG|I(NTER(RUPT|NAL)|OERR)|OK|DONE|P(ROTOCOL|ERM)|E(RROR|MPTY)|F(ORMAT|ULL)|LOCKED|A(BORT|SSOC|UTH))|3_(B(OTH|LOB)|NU(M|LL)|TEXT|INTEGER|OPEN_(READ(ONLY|WRITE)|CREATE)|FLOAT|ASSOC)))|CURL(M(SG_DONE|_(BAD_(HANDLE|EASY_HANDLE)|CALL_MULTI_PERFORM|INTERNAL_ERROR|O(UT_OF_MEMORY|K))|OPT_PUSHFUNCTION)|SSH_AUTH_(HOST|NONE|DEFAULT|P(UBLICKEY|ASSWORD)|KEYBOARD)|CLOSEPOLICY_(SLOWEST|CALLBACK|OLDEST|LEAST_(RECENTLY_USED|TRAFFIC))|_(HTTP_VERSION_(1_(1|0)|NONE)|NETRC_(REQUIRED|IGNORED|OPTIONAL)|TIMECOND_(IF(MODSINCE|UNMODSINCE)|LASTMOD)|IPRESOLVE_(V(4|6)|WHATEVER)|VERSION_(SSL|IPV6|KERBEROS4|LIBZ)|PUSH_(OK|DENY))|INFO_(RE(DIRECT_(COUNT|TIME)|QUEST_SIZE)|S(SL_VERIFYRESULT|TARTTRANSFER_TIME|IZE_(DOWNLOAD|UPLOAD)|PEED_(DOWNLOAD|UPLOAD))|H(TTP_CODE|EADER_(SIZE|OUT))|NAMELOOKUP_TIME|C(ON(NECT_TIME|TENT_(TYPE|LENGTH_(DOWNLOAD|UPLOAD)))|ERTINFO)|TOTAL_TIME|PR(IVATE|ETRANSFER_TIME)|EFFECTIVE_URL|FILETIME)|OPT_(R(E(SUME_FROM|TURNTRANSFER|DIR_PROTOCOLS|FERER|AD(DATA|FUNCTION))|AN(GE|DOM_FILE))|MAX(REDIRS|CONNECTS)|B(INARYTRANSFER|UFFERSIZE)|S(S(H_(HOST_PUBLIC_KEY_MD5|P(RIVATE_KEYFILE|UBLIC_KEYFILE)|AUTH_TYPES)|L(CERT(TYPE|PASSWD)?|_(CIPHER_LIST|VERIFY(HOST|PEER))|ENGINE(_DEFAULT)?|VERSION|KEY(TYPE|PASSWD)?))|TDERR)|H(TTP(GET|HEADER|200ALIASES|_VERSION|PROXYTUNNEL|AUTH)|EADER(FUNCTION)?)|N(O(BODY|SIGNAL|PROGRESS)|ETRC)|C(RLF|O(NNECTTIMEOUT(_MS)?|OKIE(SESSION|JAR|FILE)?)|USTOMREQUEST|ERTINFO|LOSEPOLICY|A(INFO|PATH))|T(RANSFERTEXT|CP_NODELAY|IME(CONDITION|OUT(_MS)?|VALUE))|I(N(TERFACE|FILE(SIZE)?)|PRESOLVE)|DNS_(CACHE_TIMEOUT|USE_GLOBAL_CACHE)|U(RL|SER(PWD|AGENT)|NRESTRICTED_AUTH|PLOAD)|P(R(IVATE|O(GRESSFUNCTION|XY(TYPE|USERPWD|PORT|AUTH)?|TOCOLS))|O(RT|ST(REDIR|QUOTE|FIELDS)?)|UT)|E(GDSOCKET|NCODING)|VERBOSE|K(RB4LEVEL|EYPASSWD)|QUOTE|F(RESH_CONNECT|TP(SSLAUTH|_(S(SL|KIP_PASV_IP)|CREATE_MISSING_DIRS|USE_EP(RT|SV)|FILEMETHOD)|PORT|LISTONLY|APPEND)|ILE(TIME)?|O(RBID_REUSE|LLOWLOCATION)|AILONERROR)|WRITE(HEADER|FUNCTION)|LOW_SPEED_(TIME|LIMIT)|AUTOREFERER)|PRO(XY_(SOCKS(4|5)|HTTP)|TO_(S(CP|FTP)|HTTP(S)?|T(ELNET|FTP)|DICT|F(TP(S)?|ILE)|LDAP(S)?|ALL))|E_(RE(CV_ERROR|AD_ERROR)|GOT_NOTHING|MALFORMAT_USER|BAD_(C(ONTENT_ENCODING|ALLING_ORDER)|PASSWORD_ENTERED|FUNCTION_ARGUMENT)|S(S(H|L_(C(IPHER|ONNECT_ERROR|ERTPROBLEM|ACERT)|PEER_CERTIFICATE|ENGINE_(SETFAILED|NOTFOUND)))|HARE_IN_USE|END_ERROR)|HTTP_(RANGE_ERROR|NOT_FOUND|PO(RT_FAILED|ST_ERROR))|COULDNT_(RESOLVE_(HOST|PROXY)|CONNECT)|T(OO_MANY_REDIRECTS|ELNET_OPTION_SYNTAX)|O(BSOLETE|UT_OF_MEMORY|PERATION_TIMEOUTED|K)|U(RL_MALFORMAT(_USER)?|N(SUPPORTED_PROTOCOL|KNOWN_TELNET_OPTION))|PARTIAL_FILE|F(TP_(BAD_DOWNLOAD_RESUME|SSL_FAILED|C(OULDNT_(RETR_FILE|GET_SIZE|S(TOR_FILE|ET_(BINARY|ASCII))|USE_REST)|ANT_(RECONNECT|GET_HOST))|USER_PASSWORD_INCORRECT|PORT_FAILED|QUOTE_ERROR|W(RITE_ERROR|EIRD_(SERVER_REPLY|227_FORMAT|USER_REPLY|PAS(S_REPLY|V_REPLY)))|ACCESS_DENIED)|ILE(SIZE_EXCEEDED|_COULDNT_READ_FILE)|UNCTION_NOT_FOUND|AILED_INIT)|WRITE_ERROR|L(IBRARY_NOT_FOUND|DAP_(SEARCH_FAILED|CANNOT_BIND|INVALID_URL))|ABORTED_BY_CALLBACK)|VERSION_NOW|FTP(METHOD_(MULTICWD|SINGLECWD|NOCWD)|SSL_(NONE|CONTROL|TRY|ALL)|AUTH_(SSL|TLS|DEFAULT))|AUTH_(GSSNEGOTIATE|BASIC|NTLM|DIGEST|ANY(SAFE)?))|I(MAGETYPE_(GIF|XBM|BMP|SW(C|F)|COUNT|TIFF_(MM|II)|I(CO|FF)|UNKNOWN|J(B2|P(X|2|C|EG(2000)?))|P(SD|NG)|WBMP|WEBP)|NPUT_(REQUEST|GET|SE(RVER|SSION)|COOKIE|POST|ENV)|CONV_(MIME_DECODE_(STRICT|CONTINUE_ON_ERROR)|IMPL|VERSION))|D(NS_(MX|S(RV|OA)|HINFO|N(S|APTR)|CNAME|TXT|PTR|A(NY|LL|AAA|6)?)|OM(STRING_SIZE_ERR|_(SYNTAX_ERR|HIERARCHY_REQUEST_ERR|N(O(_(MODIFICATION_ALLOWED_ERR|DATA_ALLOWED_ERR)|T_(SUPPORTED_ERR|FOUND_ERR))|AMESPACE_ERR)|IN(DEX_SIZE_ERR|USE_ATTRIBUTE_ERR|VALID_(MODIFICATION_ERR|STATE_ERR|CHARACTER_ERR|ACCESS_ERR))|PHP_ERR|VALIDATION_ERR|WRONG_DOCUMENT_ERR)))|JSON_(BIGINT_AS_STRING|HEX_(TAG|QUOT|A(MP|POS))|NUMERIC_CHECK|ERROR_(S(YNTAX|TATE_MISMATCH)|NONE|CTRL_CHAR|DEPTH|RECURSION|INF_OR_NAN|UNSUPPORTED_TYPE|INVALID_PROPERTY_NAME|UTF(8|16))|FORCE_OBJECT|UNESCAPED_(LINE_TERMINATORS|SLASHES|UNICODE)|OBJECT_AS_ARRAY|PARTIAL_OUTPUT_ON_ERROR|PRESERVE_ZERO_FRACTION|PRETTY_PRINT|INVALID_UTF8_(IGNORE|SUBSTITUTE)|THROW_ON_ERROR)|P(REG_(RECURSION_LIMIT_ERROR|GREP_INVERT|BA(CKTRACK_LIMIT_ERROR|D_UTF8_(OFFSET_ERROR|ERROR))|S(PLIT_(NO_EMPTY|OFFSET_CAPTURE|DELIM_CAPTURE)|ET_ORDER)|NO_ERROR|INTERNAL_ERROR|JIT_STACKLIMIT_ERROR|OFFSET_CAPTURE|PATTERN_ORDER|UNMATCHED_AS_NULL)|SFS_(PASS_ON|ERR_FATAL|F(EED_ME|LAG_(NORMAL|FLUSH_(CLOSE|INC))))|CRE_VERSION|OSIX_(R_OK|X_OK|S_IF(REG|BLK|SOCK|CHR|IFO)|F_OK|W_OK|RLIMIT_(AS|C(ORE|PU)|DATA|FSIZE|LOCKS|M(EMLOCK|SGQUEUE)|N(ICE|OFILE|PROC)|R(SS|T(PRIO|TIME))|S(IGPENDING|TACK)|INFINITY)))|F(NM_(NOESCAPE|CASEFOLD|P(ERIOD|ATHNAME))|IL(TER_(REQUIRE_(SCALAR|ARRAY)|SANITIZE_(MAGIC_QUOTES|S(TRI(NG|PPED)|PECIAL_CHARS)|NUMBER_(INT|FLOAT)|URL|E(MAIL|NCODED)|FULL_SPECIAL_CHARS)|NULL_ON_FAILURE|CALLBACK|DEFAULT|UNSAFE_RAW|VALIDATE_(REGEXP|BOOLEAN|I(NT|P)|URL|EMAIL|FLOAT)|F(ORCE_ARRAY|LAG_(S(CHEME_REQUIRED|TRIP_(BACKTICK|HIGH|LOW))|HOST_REQUIRED|NO(NE|_(RES_RANGE|PRIV_RANGE|ENCODE_QUOTES))|IPV(4|6)|PATH_REQUIRED|E(MPTY_STRING_NULL|MAIL_UNICODE|NCODE_(HIGH|LOW|AMP))|QUERY_REQUIRED|ALLOW_(SCIENTIFIC|HEX|THOUSAND|OCTAL|FRACTION))))|E(_(BINARY|SKIP_EMPTY_LINES|NO_DEFAULT_CONTEXT|TEXT|IGNORE_NEW_LINES|USE_INCLUDE_PATH|APPEND)|INFO_(RAW|MIME(_(TYPE|ENCODING))?|SYMLINK|NONE|CONTINUE|DEVICES|PRESERVE_ATIME)))|ORCE_(GZIP|DEFLATE))|LIBXML_(XINCLUDE|N(SCLEAN|O(XMLDECL|BLANKS|NET|CDATA|E(RROR|MPTYTAG|NT)|WARNING))|COMPACT|D(TD(VALID|LOAD|ATTR)|OTTED_VERSION)|PARSEHUGE|ERR_(NONE|ERROR|FATAL|WARNING)|VERSION|LOADED_VERSION|BIGLINES))\b

The expanded version:

(?x:\b(?:
  CURL_HTTP_VERSION_1_0
  | CURL_HTTP_VERSION_1_1
  | CURL_HTTP_VERSION_NONE
  # snip
  | XML_TEXT_NODE
)\b)

The sample file looked like this:

// SYNTAX TEST "Packages/PHPTest/PHP Test Compact Capturing.sublime-syntax"

CURL_HTTP_VERSION_1_0
CURL_HTTP_VERSION_1_1
CURL_HTTP_VERSION_NONE
...
XML_TEXT_NODE

It contained each of the 841 identifiers matched by the regexp. The entire list of identifiers was repeated 300 times, for a total of 252,302 lines including the syntax test declaration and subsequent blank line.

For each of the three expressions, I ran one test with set: main in the match rule and one without. I ran each test three times and took the median, rounded to the nearest millisecond. The results were as follows:

Time	Test
167 ms	Compact, Capturing
167 ms	Compact, Noncapturing
169 ms	Compact, Noncapturing
427 ms	Set, Compact, Capturing
170 ms	Set, Compact, Noncapturing
171 ms	Set, Expanded

My conclusions:

Capturing groups are expensive. However, because Sublime ignores unused capture groups for rules that do not modify the stack (https://github.com/SublimeTextIssues/Core/issues/2326), in many cases the difference may be optimized away. (It's worth noting that unless they are optimized away, Sublime will break tokens at capture group boundaries, so unnecessary capture groups will affect hashed color scheme rules.)
The expanded regexp does seem to be slightly but measurably slower. This could be because the cache file is larger (which in turn is presumably because it preserves the original expression text somewhere), or it could be because I did those tests later and my CPU was a bit warmer. In any case, the difference seems to be trivial.
Stack operations are cheap — or, at least, set is cheap when there are no meta scopes or cleared scopes to deal with. Adding the set rule had only a trivial impact on performance — except in the case of the capturing expression, because it stopped Sublime from ignoring the unused capture groups. Presumably the reason for this is so that pushed contexts can refer to the captures with faux-backreferences.

tl;dr: Regexp formatting doesn't matter, but don't use unnecessary capture groups.

If that is true, just use whatever stemming seems natural?

(?x:\b(?:
  CURL_HTTP_VERSION_(?:1_0|1_1|NONE)                                                   # Sure.
  |(?:XML_TEXT|CDATA|ENTITY(?:_REF)?)_NODE                                             # Maybe?
  |E(RROR_(RECURSIVE_ENTITY_REF|MISPLACED_XML_PI|B(INARY_ENTITY_REF|AD_CHAR_REF) ...)  # No!
)\b)

I confirm all your conclusions. This is exactly what I've experienced so far.

Maybe one addition:

Negative lookaheads are more expensive than positive lookaheads with the same functionality by about 5..10%.

Example:

https://github.com/sublimehq/Packages/blob/d1494f4a5a2bf8c6ae2f3e7ad762a05273df97ef/Perl/Perl.sublime-syntax#L1076

Replace the lookahead by (?!\<) and run the syntax tests against a file with 100k lines containing

print <angular quoted text>;

positive lookahead: 373ms. negative lookahead: 401ms. (+7.5%)

That's interesting, because both expressions should produce the same FSM — at least, assuming that lookahead is implemented with AFAs and then turned into an NFA via the powerset construction, which is admittedly mere speculation. Maybe there's some sort of special lookahead implementation.

Somehow the implementation must differ as negative lookaheads work if nested in another one, but positive lhs don't. (?=(?!blabla)) is ok. In (?=bla(?=bla)) the nested one is ignored.

Yeah, I just noticed that when trying to see if I could speed up the JavaScript syntax by rewriting {{identifier_break}} as a positive lookahead. This seems like a rather serious bug.

I dug into this and it's even weirder. See referenced issue.

This seems like a rather serious bug.

A serious bug that more or less no one has run into since… people don't like reading regexes like that? :-D

Having plumbed the depths of the bug in the course of writing #2918, my estimation of its severity has lessened, mostly because it only applies in fairly specific circumstances. If it were really as general as it seemed at first, I imagine that someone would have noticed it before now.

Do you have any insight into how capture groups and lookaheads are implemented in the internal engine? Those are the only supported features that don't have a standard, no-brainer FSM implementation, and the evidence seems to indicate that they're not simply compiled into a slightly-fancier FSM.

I don't know off of the top of my head, but I do believe at least some of those details are "proprietary". 🙂

@jfcherng As the original topic seems to be moot, should this issue be closed?

My major concern about the PHP syntax is most likely be kind of solved in https://github.com/sublimehq/Packages/pull/2134 hence closed.

sublimehq / Packages