sublimehq / Packages

Syntax highlighting files shipped with Sublime Text and Sublime Merge
https://sublimetext.com
Other
2.95k stars 586 forks source link

[POC] Auto generate optimized regex to make .sublime-syntax more maintainable #2044

Closed jfcherng closed 4 years ago

jfcherng commented 5 years ago

Problem

The original discussion is at https://github.com/sublimehq/Packages/pull/1740#issuecomment-449503763.

As a tl;dr, please see this diff. It's really hard to just add new tokens into old rules because those regexes are "optimized" with the cost of readability (maintainability).

Proposed Solution

We store those tokens in a new file, say parser-token.txt which looks like

T_ABSTRACT
T_AND_EQUAL
T_ARRAY
T_ARRAY_CAST
T_AS
T_BAD_CHARACTER
T_BOOLEAN_AND
... (lots of lines)

We write a .sublime-syntax template, say parser-token.tpl.yaml, for it.

%YAML 1.2
---
name: PHP {#tokens_file#}
hidden: true
scope: source.php

contexts:
  main:
    - match: (\\)?\b{#regex_optimized#}\b
      scope: support.constant.parser-token.php
      captures:
        1: punctuation.separator.namespace.php

Then, we write a python script, say compile_templates.py, to do the following jobs.

  1. load tokens from parser-token.txt
  2. load its corresponding template parser-token.tpl.yaml
  3. generate optimized regex by the content of parser-token.txt
    (?:T_(?:A(?:BSTRACT|ND_EQUAL|RRAY(?:_CAST|)|S)|B(?:AD_CHARACTER|OOLEAN_AND))|...)
  4. generate compiled syntax file parser-token.sublime-syntax with the template and the regex

    %YAML 1.2
    ---
    name: PHP parser-token
    hidden: true
    scope: source.php
    
    contexts:
     main:
       - match: (\\)?\b(?:T_(?:A(?:BSTRACT|ND_EQUAL|RRAY(?:_CAST|)|S)|...)\b
         scope: support.constant.parser-token.php
         captures:
           1: punctuation.separator.namespace.php
  5. change PHP Source.sublime-syntax to include the compiled syntax file
    -    - match: (\\)?\bT_(RE(TURN|QUIRE(_ONCE)?)|G(OTO|LOBAL)|...)\b
    -      scope: support.constant.parser-token.php
    -      captures:
    -       1: punctuation.separator.namespace.php
    +    - include: 'Packages/PHP/Tokens/compiled/parser-token.sublime-syntax'

Now, we maintain parser-token.txt rather than the unreadable optimized regex.

Proof of Concept

https://github.com/jfcherng/Packages/tree/poc-regex-optimize/PHP/Tokens

Related Scripts

michaelblyons commented 5 years ago

I might have misunderstood the performance tests Briles did that are linked from #1740, but I thought the outcome was that "compacted" regex keywords were actually slower than human-readable ones.

jfcherng commented 5 years ago

Since that PR was merged, I assume that @wbond agreed to expand those compact regexes. If the performance becomes even better after expanding them, then the only gain for a compact one is probably the cached file size which is mentioned in https://github.com/sublimehq/Packages/pull/1740#issuecomment-449558473 by @deathaxe.

The problem becomes that does the cache file size matter? What stops us from expanding them?

FichteFoll commented 5 years ago

What stops us from expanding them?

I believe the only reason that stops us from doing that currently is that nobody's taken the time to do it (and written a script that automates the process). Note that some human discretion is required to declare a pattern as "human readable" and that human-readable and compacted regexes aren't exclusive.

Thom1729 commented 5 years ago

Has anyone replicated the original performance tests using non-capturing groups?

jfcherng commented 5 years ago

I believe the only reason from doing that currently is that nobody's taken the time to do it (and written a script that automates the process).

https://github.com/asciimoo/exrex can be used to expand those compacted regexes. But beware that if there is a dash in the regex... \b(?:A|A-B)\b matches only A in string A-B. Sort components with the following compare function seems to be good.

# used to prevent the following situation (for example, for CSS syntax)
# if there is a dash in the regex, "\b(?:A|A-B)\b" matches only "A" in string "A-B".
def token_sort(a: str, b: str) -> float:
    def has_punctuation(s: str) -> bool:
        return not re.match(r"^[0-9a-zA-Z_]+$", s)

    if a == b:
        return 0

    if a.startswith(b) and has_punctuation(a):
        return -1

    if b.startswith(a) and has_punctuation(b):
        return 1

    if a > b:
        return 1

    return -1

Note that some human discretion is required to declare a pattern as "human readable" and that human-readable and compacted regexes aren't exclusive.

I think we do just like that PR https://github.com/sublimehq/Packages/pull/349/files. Each token should be plain text (no regex is allowed, not even parentheses).

Has anyone replicated the original performance tests using non-capturing groups?

Interesting. There are indeed quite lots of unused capture groups in the original comparison test.

jfcherng commented 5 years ago

Has anyone replicated the original performance tests using non-capturing groups?

I use the test case privided in https://github.com/sublimehq/Packages/pull/349#issuecomment-216545283.

The average time it takes to parse the test case over 10 runs on my machine:

All cases are quite the same.

In @Briles' benchmark results, the best case (current master) is 348.58 ms and the worst case Max Compacted is 477.68 ms. I cannot reproduce that difference. In my test, the Max Compacted case is no noticeable difference from a Usage Sorted (current master branch's, untouched) one.


Here is the fully max compacted version CSS syntax modified from the current master branch. But, again, no noticeable performance difference, still 29X ms.

The cached file size Sublime Text 3/Cache/CSS/CSS.sublime-syntax.rcache:

Thom1729 commented 5 years ago

I've always been under the impression that Sublime compiled all of the regexps from a context into a single finite state machine. Once this is done, matching is incredibly simple, and performance is guaranteed linear in the number of characters examined:

bool match(
    int initialState,
    int* doesStateAccept,
    int** transitionTable,
    char* string,
    int begin,
    int end,
) {
    int state = initialState;
    for (int i = begin; i < end; i++) {
        state = transitionTable[state][string[i]];
        if (doesStateAccept[state]) return true;
    }
    return false;
}

If two regexps are equivalent (match the same set of strings), then they should produce identical FSMs, no matter how thoroughly golfed or “optimized” the expressions may be. This is in contrast to slower-but-more-powerful backtracking-based engines like Oniguruma, in which the structure of the expression can greatly affect performance. (This is why Oniguruma has atomic groups, but FSM's don't need them.)

The above does not account for captures, because I don't know how Sublime implements them. Clearly, tracking capture groups requires more work than not tracking them requires. And we know that Sublime doesn't optimize away unnecessary capture groups, because that would visibly affect tokenization. In #1614, we concluded that these extra captures do measurably affect performance, although perhaps the difference is not noticeable at smaller scales.


If rearranging a capture-less, non-Oniguruma regexp without changing its meaning has any consistent measurable effect on parsing performance, then this calls my understanding of Sublime's regexp engine into serious question, and I thought I had a pretty solid handle on this. Perhaps @wbond or someone from Sublime HQ might be able to confirm or refute my assumptions. If it would affect the way we should write the core syntaxes, then it would be nice to get an authoritative statement on this.

jfcherng commented 5 years ago

There are quite lots of compacted regex in PHP's syntax.

Now I have a fully expanded PHP syntax file.

Performance

The benchmarked file is syntax_test_php.php.

No noticeable difference.

PHP.sublime-syntax.rcache sizes


...?

I do another test with this file to test what happens if there are LOTS of functions in a PHP file. I collect all functions that are expanded into the single PHP file as a test case.

Hmm... 25% extra cost in this case.

Thom1729 commented 5 years ago

Those two syntaxes don't seem to be equivalent. For instance, the support.constant.std.php expression differs: the expanded list contains __COMPILER_HALT_OFFSET__, whereas the compacted version does not seem to match that identifier.

I ran my own benchmark to zero in on the regexp format question. The test syntaxes are all of the following form:

%YAML 1.2
---
name: PHP Text X
scope: source.php
variables:
  name: # expression
contexts:
  main:
    - match: '{{name}}'
      scope: region.redish
      # set: main

The expression is some representation of the support.constant.ext.php expression from the core PHP syntax: either the compact representation as it is in the syntax, or that version with the capturing groups replaced by noncapturing groups, or @jfcherng's fully expanded version. I removed the punctuation match. The compact version with captures:

\b(GLOB_(MARK|BRACE|NO(SORT|CHECK|ESCAPE)|ONLYDIR|ERR|AVAILABLE_FLAGS)|XML_(SAX_IMPL|HTML_DOCUMENT_NODE|N(OTATION_NODE|AMESPACE_DECL_NODE)|C(OMMENT_NODE|DATA_SECTION_NODE)|TEXT_NODE|OPTION_(SKIP_(TAGSTART|WHITE)|CASE_FOLDING|TARGET_ENCODING)|D(TD_NODE|OCUMENT_(NODE|TYPE_NODE|FRAG_NODE))|PI_NODE|E(RROR_(RECURSIVE_ENTITY_REF|MISPLACED_XML_PI|B(INARY_ENTITY_REF|AD_CHAR_REF)|SYNTAX|NO(NE|_(MEMORY|ELEMENTS))|TAG_MISMATCH|IN(CORRECT_ENCODING|VALID_TOKEN)|DUPLICATE_ATTRIBUTE|UN(CLOSED_(CDATA_SECTION|TOKEN)|DEFINED_ENTITY|KNOWN_ENCODING)|JUNK_AFTER_DOC_ELEMENT|PAR(TIAL_CHAR|AM_ENTITY_REF)|EXTERNAL_ENTITY_HANDLING|A(SYNC_ENTITY|TTRIBUTE_EXTERNAL_ENTITY_REF))|NTITY_(REF_NODE|NODE|DECL_NODE)|LEMENT_(NODE|DECL_NODE))|LOCAL_NAMESPACE|ATTRIBUTE_(N(MTOKEN(S)?|O(TATION|DE))|CDATA|ID(REF(S)?)?|DECL_NODE|EN(TITY|UMERATION)))|M(HASH_(RIPEMD(1(28|60)|256|320)|GOST|MD(2|4|5)|S(HA(1|2(24|56)|384|512)|NEFRU256)|HAVAL(1(28|92|60)|2(24|56))|CRC32(B)?|TIGER(1(28|60))?|WHIRLPOOL|ADLER32)|YSQL(_(BOTH|NUM|CLIENT_(SSL|COMPRESS|I(GNORE_SPACE|NTERACTIVE))|ASSOC)|I_(RE(PORT_(STRICT|INDEX|OFF|ERROR|ALL)|FRESH_(GRANT|MASTER|BACKUP_LOG|S(TATUS|LAVE)|HOSTS|T(HREADS|ABLES)|LOG)|AD_DEFAULT_(GROUP|FILE))|GROUP_FLAG|MULTIPLE_KEY_FLAG|B(INARY_FLAG|OTH|LOB_FLAG)|S(T(MT_ATTR_(CURSOR_TYPE|UPDATE_MAX_LENGTH|PREFETCH_ROWS)|ORE_RESULT)|E(RVER_QUERY_(NO_(GOOD_INDEX_USED|INDEX_USED)|WAS_SLOW)|T_(CHARSET_NAME|FLAG)))|N(O(_D(EFAULT_VALUE_FLAG|ATA)|T_NULL_FLAG)|UM(_FLAG)?)|C(URSOR_TYPE_(READ_ONLY|SCROLLABLE|NO_CURSOR|FOR_UPDATE)|LIENT_(SSL|NO_SCHEMA|COMPRESS|I(GNORE_SPACE|NTERACTIVE)|FOUND_ROWS))|T(YPE_(GEOMETRY|MEDIUM_BLOB|B(IT|LOB)|S(HORT|TRING|ET)|YEAR|N(ULL|EWD(ECIMAL|ATE))|CHAR|TI(ME(STAMP)?|NY(_BLOB)?)|INT(24|ERVAL)|D(OUBLE|ECIMAL|ATE(TIME)?)|ENUM|VAR_STRING|FLOAT|LONG(_BLOB|LONG)?)|IMESTAMP_FLAG)|INIT_COMMAND|ZEROFILL_FLAG|O(N_UPDATE_NOW_FLAG|PT_(NET_(READ_BUFFER_SIZE|CMD_BUFFER_SIZE)|CONNECT_TIMEOUT|INT_AND_FLOAT_NATIVE|LOCAL_INFILE))|D(EBUG_TRACE_ENABLED|ATA_TRUNCATED)|U(SE_RESULT|N(SIGNED_FLAG|IQUE_KEY_FLAG))|P(RI_KEY_FLAG|ART_KEY_FLAG)|ENUM_FLAG|A(S(SOC|YNC)|UTO_INCREMENT_FLAG)))|CRYPT_(R(C(2|6)|IJNDAEL_(1(28|92)|256)|AND)|GOST|XTEA|M(ODE_(STREAM|NOFB|C(BC|FB)|OFB|ECB)|ARS)|BLOWFISH(_COMPAT)?|S(ERPENT|KIPJACK|AFER(128|PLUS|64))|C(RYPT|AST_(128|256))|T(RIPLEDES|HREEWAY|WOFISH)|IDEA|3DES|DE(S|CRYPT|V_(RANDOM|URANDOM))|PANAMA|EN(CRYPT|IGNA)|WAKE|LOKI97|ARCFOUR(_IV)?))|S(TREAM_(REPORT_ERRORS|M(UST_SEEK|KDIR_RECURSIVE)|BUFFER_(NONE|FULL|LINE)|S(HUT_(RD(WR)?|WR)|OCK_(R(DM|AW)|S(TREAM|EQPACKET)|DGRAM)|ERVER_(BIND|LISTEN))|NOTIFY_(RE(SOLVE|DIRECTED)|MIME_TYPE_IS|SEVERITY_(INFO|ERR|WARN)|CO(MPLETED|NNECT)|PROGRESS|F(ILE_SIZE_IS|AILURE)|AUTH_RE(SULT|QUIRED))|C(RYPTO_METHOD_(SSLv(2(_(SERVER|CLIENT)|3_(SERVER|CLIENT))|3_(SERVER|CLIENT))|TLS_(SERVER|CLIENT))|LIENT_(CONNECT|PERSISTENT|ASYNC_CONNECT)|AST_(FOR_SELECT|AS_STREAM))|I(GNORE_URL|S_URL|PPROTO_(RAW|TCP|I(CMP|P)|UDP))|O(OB|PTION_(READ_(BUFFER|TIMEOUT)|BLOCKING|WRITE_BUFFER))|U(RL_STAT_(QUIET|LINK)|SE_PATH)|P(EEK|F_(INET(6)?|UNIX))|ENFORCE_SAFE_MODE|FILTER_(READ|WRITE|ALL))|UNFUNCS_RET_(STRING|TIMESTAMP|DOUBLE)|QLITE(_(R(OW|EADONLY)|MIS(MATCH|USE)|B(OTH|USY)|SCHEMA|N(O(MEM|T(FOUND|ADB)|LFS)|UM)|C(O(RRUPT|NSTRAINT)|ANTOPEN)|TOOBIG|I(NTER(RUPT|NAL)|OERR)|OK|DONE|P(ROTOCOL|ERM)|E(RROR|MPTY)|F(ORMAT|ULL)|LOCKED|A(BORT|SSOC|UTH))|3_(B(OTH|LOB)|NU(M|LL)|TEXT|INTEGER|OPEN_(READ(ONLY|WRITE)|CREATE)|FLOAT|ASSOC)))|CURL(M(SG_DONE|_(BAD_(HANDLE|EASY_HANDLE)|CALL_MULTI_PERFORM|INTERNAL_ERROR|O(UT_OF_MEMORY|K))|OPT_PUSHFUNCTION)|SSH_AUTH_(HOST|NONE|DEFAULT|P(UBLICKEY|ASSWORD)|KEYBOARD)|CLOSEPOLICY_(SLOWEST|CALLBACK|OLDEST|LEAST_(RECENTLY_USED|TRAFFIC))|_(HTTP_VERSION_(1_(1|0)|NONE)|NETRC_(REQUIRED|IGNORED|OPTIONAL)|TIMECOND_(IF(MODSINCE|UNMODSINCE)|LASTMOD)|IPRESOLVE_(V(4|6)|WHATEVER)|VERSION_(SSL|IPV6|KERBEROS4|LIBZ)|PUSH_(OK|DENY))|INFO_(RE(DIRECT_(COUNT|TIME)|QUEST_SIZE)|S(SL_VERIFYRESULT|TARTTRANSFER_TIME|IZE_(DOWNLOAD|UPLOAD)|PEED_(DOWNLOAD|UPLOAD))|H(TTP_CODE|EADER_(SIZE|OUT))|NAMELOOKUP_TIME|C(ON(NECT_TIME|TENT_(TYPE|LENGTH_(DOWNLOAD|UPLOAD)))|ERTINFO)|TOTAL_TIME|PR(IVATE|ETRANSFER_TIME)|EFFECTIVE_URL|FILETIME)|OPT_(R(E(SUME_FROM|TURNTRANSFER|DIR_PROTOCOLS|FERER|AD(DATA|FUNCTION))|AN(GE|DOM_FILE))|MAX(REDIRS|CONNECTS)|B(INARYTRANSFER|UFFERSIZE)|S(S(H_(HOST_PUBLIC_KEY_MD5|P(RIVATE_KEYFILE|UBLIC_KEYFILE)|AUTH_TYPES)|L(CERT(TYPE|PASSWD)?|_(CIPHER_LIST|VERIFY(HOST|PEER))|ENGINE(_DEFAULT)?|VERSION|KEY(TYPE|PASSWD)?))|TDERR)|H(TTP(GET|HEADER|200ALIASES|_VERSION|PROXYTUNNEL|AUTH)|EADER(FUNCTION)?)|N(O(BODY|SIGNAL|PROGRESS)|ETRC)|C(RLF|O(NNECTTIMEOUT(_MS)?|OKIE(SESSION|JAR|FILE)?)|USTOMREQUEST|ERTINFO|LOSEPOLICY|A(INFO|PATH))|T(RANSFERTEXT|CP_NODELAY|IME(CONDITION|OUT(_MS)?|VALUE))|I(N(TERFACE|FILE(SIZE)?)|PRESOLVE)|DNS_(CACHE_TIMEOUT|USE_GLOBAL_CACHE)|U(RL|SER(PWD|AGENT)|NRESTRICTED_AUTH|PLOAD)|P(R(IVATE|O(GRESSFUNCTION|XY(TYPE|USERPWD|PORT|AUTH)?|TOCOLS))|O(RT|ST(REDIR|QUOTE|FIELDS)?)|UT)|E(GDSOCKET|NCODING)|VERBOSE|K(RB4LEVEL|EYPASSWD)|QUOTE|F(RESH_CONNECT|TP(SSLAUTH|_(S(SL|KIP_PASV_IP)|CREATE_MISSING_DIRS|USE_EP(RT|SV)|FILEMETHOD)|PORT|LISTONLY|APPEND)|ILE(TIME)?|O(RBID_REUSE|LLOWLOCATION)|AILONERROR)|WRITE(HEADER|FUNCTION)|LOW_SPEED_(TIME|LIMIT)|AUTOREFERER)|PRO(XY_(SOCKS(4|5)|HTTP)|TO_(S(CP|FTP)|HTTP(S)?|T(ELNET|FTP)|DICT|F(TP(S)?|ILE)|LDAP(S)?|ALL))|E_(RE(CV_ERROR|AD_ERROR)|GOT_NOTHING|MALFORMAT_USER|BAD_(C(ONTENT_ENCODING|ALLING_ORDER)|PASSWORD_ENTERED|FUNCTION_ARGUMENT)|S(S(H|L_(C(IPHER|ONNECT_ERROR|ERTPROBLEM|ACERT)|PEER_CERTIFICATE|ENGINE_(SETFAILED|NOTFOUND)))|HARE_IN_USE|END_ERROR)|HTTP_(RANGE_ERROR|NOT_FOUND|PO(RT_FAILED|ST_ERROR))|COULDNT_(RESOLVE_(HOST|PROXY)|CONNECT)|T(OO_MANY_REDIRECTS|ELNET_OPTION_SYNTAX)|O(BSOLETE|UT_OF_MEMORY|PERATION_TIMEOUTED|K)|U(RL_MALFORMAT(_USER)?|N(SUPPORTED_PROTOCOL|KNOWN_TELNET_OPTION))|PARTIAL_FILE|F(TP_(BAD_DOWNLOAD_RESUME|SSL_FAILED|C(OULDNT_(RETR_FILE|GET_SIZE|S(TOR_FILE|ET_(BINARY|ASCII))|USE_REST)|ANT_(RECONNECT|GET_HOST))|USER_PASSWORD_INCORRECT|PORT_FAILED|QUOTE_ERROR|W(RITE_ERROR|EIRD_(SERVER_REPLY|227_FORMAT|USER_REPLY|PAS(S_REPLY|V_REPLY)))|ACCESS_DENIED)|ILE(SIZE_EXCEEDED|_COULDNT_READ_FILE)|UNCTION_NOT_FOUND|AILED_INIT)|WRITE_ERROR|L(IBRARY_NOT_FOUND|DAP_(SEARCH_FAILED|CANNOT_BIND|INVALID_URL))|ABORTED_BY_CALLBACK)|VERSION_NOW|FTP(METHOD_(MULTICWD|SINGLECWD|NOCWD)|SSL_(NONE|CONTROL|TRY|ALL)|AUTH_(SSL|TLS|DEFAULT))|AUTH_(GSSNEGOTIATE|BASIC|NTLM|DIGEST|ANY(SAFE)?))|I(MAGETYPE_(GIF|XBM|BMP|SW(C|F)|COUNT|TIFF_(MM|II)|I(CO|FF)|UNKNOWN|J(B2|P(X|2|C|EG(2000)?))|P(SD|NG)|WBMP|WEBP)|NPUT_(REQUEST|GET|SE(RVER|SSION)|COOKIE|POST|ENV)|CONV_(MIME_DECODE_(STRICT|CONTINUE_ON_ERROR)|IMPL|VERSION))|D(NS_(MX|S(RV|OA)|HINFO|N(S|APTR)|CNAME|TXT|PTR|A(NY|LL|AAA|6)?)|OM(STRING_SIZE_ERR|_(SYNTAX_ERR|HIERARCHY_REQUEST_ERR|N(O(_(MODIFICATION_ALLOWED_ERR|DATA_ALLOWED_ERR)|T_(SUPPORTED_ERR|FOUND_ERR))|AMESPACE_ERR)|IN(DEX_SIZE_ERR|USE_ATTRIBUTE_ERR|VALID_(MODIFICATION_ERR|STATE_ERR|CHARACTER_ERR|ACCESS_ERR))|PHP_ERR|VALIDATION_ERR|WRONG_DOCUMENT_ERR)))|JSON_(BIGINT_AS_STRING|HEX_(TAG|QUOT|A(MP|POS))|NUMERIC_CHECK|ERROR_(S(YNTAX|TATE_MISMATCH)|NONE|CTRL_CHAR|DEPTH|RECURSION|INF_OR_NAN|UNSUPPORTED_TYPE|INVALID_PROPERTY_NAME|UTF(8|16))|FORCE_OBJECT|UNESCAPED_(LINE_TERMINATORS|SLASHES|UNICODE)|OBJECT_AS_ARRAY|PARTIAL_OUTPUT_ON_ERROR|PRESERVE_ZERO_FRACTION|PRETTY_PRINT|INVALID_UTF8_(IGNORE|SUBSTITUTE)|THROW_ON_ERROR)|P(REG_(RECURSION_LIMIT_ERROR|GREP_INVERT|BA(CKTRACK_LIMIT_ERROR|D_UTF8_(OFFSET_ERROR|ERROR))|S(PLIT_(NO_EMPTY|OFFSET_CAPTURE|DELIM_CAPTURE)|ET_ORDER)|NO_ERROR|INTERNAL_ERROR|JIT_STACKLIMIT_ERROR|OFFSET_CAPTURE|PATTERN_ORDER|UNMATCHED_AS_NULL)|SFS_(PASS_ON|ERR_FATAL|F(EED_ME|LAG_(NORMAL|FLUSH_(CLOSE|INC))))|CRE_VERSION|OSIX_(R_OK|X_OK|S_IF(REG|BLK|SOCK|CHR|IFO)|F_OK|W_OK|RLIMIT_(AS|C(ORE|PU)|DATA|FSIZE|LOCKS|M(EMLOCK|SGQUEUE)|N(ICE|OFILE|PROC)|R(SS|T(PRIO|TIME))|S(IGPENDING|TACK)|INFINITY)))|F(NM_(NOESCAPE|CASEFOLD|P(ERIOD|ATHNAME))|IL(TER_(REQUIRE_(SCALAR|ARRAY)|SANITIZE_(MAGIC_QUOTES|S(TRI(NG|PPED)|PECIAL_CHARS)|NUMBER_(INT|FLOAT)|URL|E(MAIL|NCODED)|FULL_SPECIAL_CHARS)|NULL_ON_FAILURE|CALLBACK|DEFAULT|UNSAFE_RAW|VALIDATE_(REGEXP|BOOLEAN|I(NT|P)|URL|EMAIL|FLOAT)|F(ORCE_ARRAY|LAG_(S(CHEME_REQUIRED|TRIP_(BACKTICK|HIGH|LOW))|HOST_REQUIRED|NO(NE|_(RES_RANGE|PRIV_RANGE|ENCODE_QUOTES))|IPV(4|6)|PATH_REQUIRED|E(MPTY_STRING_NULL|MAIL_UNICODE|NCODE_(HIGH|LOW|AMP))|QUERY_REQUIRED|ALLOW_(SCIENTIFIC|HEX|THOUSAND|OCTAL|FRACTION))))|E(_(BINARY|SKIP_EMPTY_LINES|NO_DEFAULT_CONTEXT|TEXT|IGNORE_NEW_LINES|USE_INCLUDE_PATH|APPEND)|INFO_(RAW|MIME(_(TYPE|ENCODING))?|SYMLINK|NONE|CONTINUE|DEVICES|PRESERVE_ATIME)))|ORCE_(GZIP|DEFLATE))|LIBXML_(XINCLUDE|N(SCLEAN|O(XMLDECL|BLANKS|NET|CDATA|E(RROR|MPTYTAG|NT)|WARNING))|COMPACT|D(TD(VALID|LOAD|ATTR)|OTTED_VERSION)|PARSEHUGE|ERR_(NONE|ERROR|FATAL|WARNING)|VERSION|LOADED_VERSION|BIGLINES))\b

The expanded version:

(?x:\b(?:
  CURL_HTTP_VERSION_1_0
  | CURL_HTTP_VERSION_1_1
  | CURL_HTTP_VERSION_NONE
  # snip
  | XML_TEXT_NODE
)\b)

The sample file looked like this:

// SYNTAX TEST "Packages/PHPTest/PHP Test Compact Capturing.sublime-syntax"

CURL_HTTP_VERSION_1_0
CURL_HTTP_VERSION_1_1
CURL_HTTP_VERSION_NONE
...
XML_TEXT_NODE

It contained each of the 841 identifiers matched by the regexp. The entire list of identifiers was repeated 300 times, for a total of 252,302 lines including the syntax test declaration and subsequent blank line.

For each of the three expressions, I ran one test with set: main in the match rule and one without. I ran each test three times and took the median, rounded to the nearest millisecond. The results were as follows:

Time Test
167 ms Compact, Capturing
167 ms Compact, Noncapturing
169 ms Compact, Noncapturing
427 ms Set, Compact, Capturing
170 ms Set, Compact, Noncapturing
171 ms Set, Expanded

My conclusions:

tl;dr: Regexp formatting doesn't matter, but don't use unnecessary capture groups.

michaelblyons commented 5 years ago

If that is true, just use whatever stemming seems natural?

(?x:\b(?:
  CURL_HTTP_VERSION_(?:1_0|1_1|NONE)                                                   # Sure.
  |(?:XML_TEXT|CDATA|ENTITY(?:_REF)?)_NODE                                             # Maybe?
  |E(RROR_(RECURSIVE_ENTITY_REF|MISPLACED_XML_PI|B(INARY_ENTITY_REF|AD_CHAR_REF) ...)  # No!
)\b)
deathaxe commented 5 years ago

I confirm all your conclusions. This is exactly what I've experienced so far.

Maybe one addition:

Example:

https://github.com/sublimehq/Packages/blob/d1494f4a5a2bf8c6ae2f3e7ad762a05273df97ef/Perl/Perl.sublime-syntax#L1076

Replace the lookahead by (?!\<) and run the syntax tests against a file with 100k lines containing

print <angular quoted text>;

positive lookahead: 373ms. negative lookahead: 401ms. (+7.5%)

Thom1729 commented 5 years ago

That's interesting, because both expressions should produce the same FSM — at least, assuming that lookahead is implemented with AFAs and then turned into an NFA via the powerset construction, which is admittedly mere speculation. Maybe there's some sort of special lookahead implementation.

deathaxe commented 5 years ago

Somehow the implementation must differ as negative lookaheads work if nested in another one, but positive lhs don't. (?=(?!blabla)) is ok. In (?=bla(?=bla)) the nested one is ignored.

Thom1729 commented 5 years ago

Yeah, I just noticed that when trying to see if I could speed up the JavaScript syntax by rewriting {{identifier_break}} as a positive lookahead. This seems like a rather serious bug.

Thom1729 commented 5 years ago

I dug into this and it's even weirder. See referenced issue.

wbond commented 5 years ago

This seems like a rather serious bug.

A serious bug that more or less no one has run into since… people don't like reading regexes like that? :-D

Thom1729 commented 5 years ago

Having plumbed the depths of the bug in the course of writing #2918, my estimation of its severity has lessened, mostly because it only applies in fairly specific circumstances. If it were really as general as it seemed at first, I imagine that someone would have noticed it before now.

Do you have any insight into how capture groups and lookaheads are implemented in the internal engine? Those are the only supported features that don't have a standard, no-brainer FSM implementation, and the evidence seems to indicate that they're not simply compiled into a slightly-fancier FSM.

wbond commented 5 years ago

I don't know off of the top of my head, but I do believe at least some of those details are "proprietary". 🙂

Thom1729 commented 4 years ago

@jfcherng As the original topic seems to be moot, should this issue be closed?

jfcherng commented 4 years ago

My major concern about the PHP syntax is most likely be kind of solved in https://github.com/sublimehq/Packages/pull/2134 hence closed.