Closed jfcherng closed 4 years ago
I might have misunderstood the performance tests Briles did that are linked from #1740, but I thought the outcome was that "compacted" regex keywords were actually slower than human-readable ones.
Since that PR was merged, I assume that @wbond agreed to expand those compact regexes. If the performance becomes even better after expanding them, then the only gain for a compact one is probably the cached file size which is mentioned in https://github.com/sublimehq/Packages/pull/1740#issuecomment-449558473 by @deathaxe.
The problem becomes that does the cache file size matter? What stops us from expanding them?
What stops us from expanding them?
I believe the only reason that stops us from doing that currently is that nobody's taken the time to do it (and written a script that automates the process). Note that some human discretion is required to declare a pattern as "human readable" and that human-readable and compacted regexes aren't exclusive.
Has anyone replicated the original performance tests using non-capturing groups?
I believe the only reason from doing that currently is that nobody's taken the time to do it (and written a script that automates the process).
https://github.com/asciimoo/exrex can be used to expand those compacted regexes.
But beware that if there is a dash in the regex... \b(?:A|A-B)\b
matches only A
in string A-B
.
Sort components with the following compare function seems to be good.
# used to prevent the following situation (for example, for CSS syntax)
# if there is a dash in the regex, "\b(?:A|A-B)\b" matches only "A" in string "A-B".
def token_sort(a: str, b: str) -> float:
def has_punctuation(s: str) -> bool:
return not re.match(r"^[0-9a-zA-Z_]+$", s)
if a == b:
return 0
if a.startswith(b) and has_punctuation(a):
return -1
if b.startswith(a) and has_punctuation(b):
return 1
if a > b:
return 1
return -1
Note that some human discretion is required to declare a pattern as "human readable" and that human-readable and compacted regexes aren't exclusive.
I think we do just like that PR https://github.com/sublimehq/Packages/pull/349/files. Each token should be plain text (no regex is allowed, not even parentheses).
Has anyone replicated the original performance tests using non-capturing groups?
Interesting. There are indeed quite lots of unused capture groups in the original comparison test.
Has anyone replicated the original performance tests using non-capturing groups?
I use the test case privided in https://github.com/sublimehq/Packages/pull/349#issuecomment-216545283.
The average time it takes to parse the test case over 10 runs on my machine:
Current master: 29X ms
Current master + Undo the PR: 29X ms
Current master + Undo the PR + Use non-capture groups (?:)
: 29X ms
Current master + Undo the PR + Max Compacted
: 29X ms
All cases are quite the same.
In @Briles' benchmark results, the best case (current master) is 348.58 ms and the worst case Max Compacted
is 477.68 ms. I cannot reproduce that difference. In my test, the Max Compacted
case is no noticeable difference from a Usage Sorted (current master branch's, untouched)
one.
Here is the fully max compacted
version CSS syntax modified from the current master branch. But, again, no noticeable performance difference, still 29X ms.
The cached file size Sublime Text 3/Cache/CSS/CSS.sublime-syntax.rcache
:
fully max compacted
: 123 KBI've always been under the impression that Sublime compiled all of the regexps from a context into a single finite state machine. Once this is done, matching is incredibly simple, and performance is guaranteed linear in the number of characters examined:
bool match(
int initialState,
int* doesStateAccept,
int** transitionTable,
char* string,
int begin,
int end,
) {
int state = initialState;
for (int i = begin; i < end; i++) {
state = transitionTable[state][string[i]];
if (doesStateAccept[state]) return true;
}
return false;
}
If two regexps are equivalent (match the same set of strings), then they should produce identical FSMs, no matter how thoroughly golfed or “optimized” the expressions may be. This is in contrast to slower-but-more-powerful backtracking-based engines like Oniguruma, in which the structure of the expression can greatly affect performance. (This is why Oniguruma has atomic groups, but FSM's don't need them.)
The above does not account for captures, because I don't know how Sublime implements them. Clearly, tracking capture groups requires more work than not tracking them requires. And we know that Sublime doesn't optimize away unnecessary capture groups, because that would visibly affect tokenization. In #1614, we concluded that these extra captures do measurably affect performance, although perhaps the difference is not noticeable at smaller scales.
If rearranging a capture-less, non-Oniguruma regexp without changing its meaning has any consistent measurable effect on parsing performance, then this calls my understanding of Sublime's regexp engine into serious question, and I thought I had a pretty solid handle on this. Perhaps @wbond or someone from Sublime HQ might be able to confirm or refute my assumptions. If it would affect the way we should write the core syntaxes, then it would be nice to get an authoritative statement on this.
There are quite lots of compacted regex in PHP's syntax.
Now I have a fully expanded PHP syntax file.
The benchmarked file is syntax_test_php.php
.
No noticeable difference.
PHP.sublime-syntax.rcache
sizesI do another test with this file to test what happens if there are LOTS of functions in a PHP file. I collect all functions that are expanded into the single PHP file as a test case.
Hmm... 25% extra cost in this case.
Those two syntaxes don't seem to be equivalent. For instance, the support.constant.std.php
expression differs: the expanded list contains __COMPILER_HALT_OFFSET__
, whereas the compacted version does not seem to match that identifier.
I ran my own benchmark to zero in on the regexp format question. The test syntaxes are all of the following form:
%YAML 1.2
---
name: PHP Text X
scope: source.php
variables:
name: # expression
contexts:
main:
- match: '{{name}}'
scope: region.redish
# set: main
The expression is some representation of the support.constant.ext.php
expression from the core PHP syntax: either the compact representation as it is in the syntax, or that version with the capturing groups replaced by noncapturing groups, or @jfcherng's fully expanded version. I removed the punctuation match. The compact version with captures:
\b(GLOB_(MARK|BRACE|NO(SORT|CHECK|ESCAPE)|ONLYDIR|ERR|AVAILABLE_FLAGS)|XML_(SAX_IMPL|HTML_DOCUMENT_NODE|N(OTATION_NODE|AMESPACE_DECL_NODE)|C(OMMENT_NODE|DATA_SECTION_NODE)|TEXT_NODE|OPTION_(SKIP_(TAGSTART|WHITE)|CASE_FOLDING|TARGET_ENCODING)|D(TD_NODE|OCUMENT_(NODE|TYPE_NODE|FRAG_NODE))|PI_NODE|E(RROR_(RECURSIVE_ENTITY_REF|MISPLACED_XML_PI|B(INARY_ENTITY_REF|AD_CHAR_REF)|SYNTAX|NO(NE|_(MEMORY|ELEMENTS))|TAG_MISMATCH|IN(CORRECT_ENCODING|VALID_TOKEN)|DUPLICATE_ATTRIBUTE|UN(CLOSED_(CDATA_SECTION|TOKEN)|DEFINED_ENTITY|KNOWN_ENCODING)|JUNK_AFTER_DOC_ELEMENT|PAR(TIAL_CHAR|AM_ENTITY_REF)|EXTERNAL_ENTITY_HANDLING|A(SYNC_ENTITY|TTRIBUTE_EXTERNAL_ENTITY_REF))|NTITY_(REF_NODE|NODE|DECL_NODE)|LEMENT_(NODE|DECL_NODE))|LOCAL_NAMESPACE|ATTRIBUTE_(N(MTOKEN(S)?|O(TATION|DE))|CDATA|ID(REF(S)?)?|DECL_NODE|EN(TITY|UMERATION)))|M(HASH_(RIPEMD(1(28|60)|256|320)|GOST|MD(2|4|5)|S(HA(1|2(24|56)|384|512)|NEFRU256)|HAVAL(1(28|92|60)|2(24|56))|CRC32(B)?|TIGER(1(28|60))?|WHIRLPOOL|ADLER32)|YSQL(_(BOTH|NUM|CLIENT_(SSL|COMPRESS|I(GNORE_SPACE|NTERACTIVE))|ASSOC)|I_(RE(PORT_(STRICT|INDEX|OFF|ERROR|ALL)|FRESH_(GRANT|MASTER|BACKUP_LOG|S(TATUS|LAVE)|HOSTS|T(HREADS|ABLES)|LOG)|AD_DEFAULT_(GROUP|FILE))|GROUP_FLAG|MULTIPLE_KEY_FLAG|B(INARY_FLAG|OTH|LOB_FLAG)|S(T(MT_ATTR_(CURSOR_TYPE|UPDATE_MAX_LENGTH|PREFETCH_ROWS)|ORE_RESULT)|E(RVER_QUERY_(NO_(GOOD_INDEX_USED|INDEX_USED)|WAS_SLOW)|T_(CHARSET_NAME|FLAG)))|N(O(_D(EFAULT_VALUE_FLAG|ATA)|T_NULL_FLAG)|UM(_FLAG)?)|C(URSOR_TYPE_(READ_ONLY|SCROLLABLE|NO_CURSOR|FOR_UPDATE)|LIENT_(SSL|NO_SCHEMA|COMPRESS|I(GNORE_SPACE|NTERACTIVE)|FOUND_ROWS))|T(YPE_(GEOMETRY|MEDIUM_BLOB|B(IT|LOB)|S(HORT|TRING|ET)|YEAR|N(ULL|EWD(ECIMAL|ATE))|CHAR|TI(ME(STAMP)?|NY(_BLOB)?)|INT(24|ERVAL)|D(OUBLE|ECIMAL|ATE(TIME)?)|ENUM|VAR_STRING|FLOAT|LONG(_BLOB|LONG)?)|IMESTAMP_FLAG)|INIT_COMMAND|ZEROFILL_FLAG|O(N_UPDATE_NOW_FLAG|PT_(NET_(READ_BUFFER_SIZE|CMD_BUFFER_SIZE)|CONNECT_TIMEOUT|INT_AND_FLOAT_NATIVE|LOCAL_INFILE))|D(EBUG_TRACE_ENABLED|ATA_TRUNCATED)|U(SE_RESULT|N(SIGNED_FLAG|IQUE_KEY_FLAG))|P(RI_KEY_FLAG|ART_KEY_FLAG)|ENUM_FLAG|A(S(SOC|YNC)|UTO_INCREMENT_FLAG)))|CRYPT_(R(C(2|6)|IJNDAEL_(1(28|92)|256)|AND)|GOST|XTEA|M(ODE_(STREAM|NOFB|C(BC|FB)|OFB|ECB)|ARS)|BLOWFISH(_COMPAT)?|S(ERPENT|KIPJACK|AFER(128|PLUS|64))|C(RYPT|AST_(128|256))|T(RIPLEDES|HREEWAY|WOFISH)|IDEA|3DES|DE(S|CRYPT|V_(RANDOM|URANDOM))|PANAMA|EN(CRYPT|IGNA)|WAKE|LOKI97|ARCFOUR(_IV)?))|S(TREAM_(REPORT_ERRORS|M(UST_SEEK|KDIR_RECURSIVE)|BUFFER_(NONE|FULL|LINE)|S(HUT_(RD(WR)?|WR)|OCK_(R(DM|AW)|S(TREAM|EQPACKET)|DGRAM)|ERVER_(BIND|LISTEN))|NOTIFY_(RE(SOLVE|DIRECTED)|MIME_TYPE_IS|SEVERITY_(INFO|ERR|WARN)|CO(MPLETED|NNECT)|PROGRESS|F(ILE_SIZE_IS|AILURE)|AUTH_RE(SULT|QUIRED))|C(RYPTO_METHOD_(SSLv(2(_(SERVER|CLIENT)|3_(SERVER|CLIENT))|3_(SERVER|CLIENT))|TLS_(SERVER|CLIENT))|LIENT_(CONNECT|PERSISTENT|ASYNC_CONNECT)|AST_(FOR_SELECT|AS_STREAM))|I(GNORE_URL|S_URL|PPROTO_(RAW|TCP|I(CMP|P)|UDP))|O(OB|PTION_(READ_(BUFFER|TIMEOUT)|BLOCKING|WRITE_BUFFER))|U(RL_STAT_(QUIET|LINK)|SE_PATH)|P(EEK|F_(INET(6)?|UNIX))|ENFORCE_SAFE_MODE|FILTER_(READ|WRITE|ALL))|UNFUNCS_RET_(STRING|TIMESTAMP|DOUBLE)|QLITE(_(R(OW|EADONLY)|MIS(MATCH|USE)|B(OTH|USY)|SCHEMA|N(O(MEM|T(FOUND|ADB)|LFS)|UM)|C(O(RRUPT|NSTRAINT)|ANTOPEN)|TOOBIG|I(NTER(RUPT|NAL)|OERR)|OK|DONE|P(ROTOCOL|ERM)|E(RROR|MPTY)|F(ORMAT|ULL)|LOCKED|A(BORT|SSOC|UTH))|3_(B(OTH|LOB)|NU(M|LL)|TEXT|INTEGER|OPEN_(READ(ONLY|WRITE)|CREATE)|FLOAT|ASSOC)))|CURL(M(SG_DONE|_(BAD_(HANDLE|EASY_HANDLE)|CALL_MULTI_PERFORM|INTERNAL_ERROR|O(UT_OF_MEMORY|K))|OPT_PUSHFUNCTION)|SSH_AUTH_(HOST|NONE|DEFAULT|P(UBLICKEY|ASSWORD)|KEYBOARD)|CLOSEPOLICY_(SLOWEST|CALLBACK|OLDEST|LEAST_(RECENTLY_USED|TRAFFIC))|_(HTTP_VERSION_(1_(1|0)|NONE)|NETRC_(REQUIRED|IGNORED|OPTIONAL)|TIMECOND_(IF(MODSINCE|UNMODSINCE)|LASTMOD)|IPRESOLVE_(V(4|6)|WHATEVER)|VERSION_(SSL|IPV6|KERBEROS4|LIBZ)|PUSH_(OK|DENY))|INFO_(RE(DIRECT_(COUNT|TIME)|QUEST_SIZE)|S(SL_VERIFYRESULT|TARTTRANSFER_TIME|IZE_(DOWNLOAD|UPLOAD)|PEED_(DOWNLOAD|UPLOAD))|H(TTP_CODE|EADER_(SIZE|OUT))|NAMELOOKUP_TIME|C(ON(NECT_TIME|TENT_(TYPE|LENGTH_(DOWNLOAD|UPLOAD)))|ERTINFO)|TOTAL_TIME|PR(IVATE|ETRANSFER_TIME)|EFFECTIVE_URL|FILETIME)|OPT_(R(E(SUME_FROM|TURNTRANSFER|DIR_PROTOCOLS|FERER|AD(DATA|FUNCTION))|AN(GE|DOM_FILE))|MAX(REDIRS|CONNECTS)|B(INARYTRANSFER|UFFERSIZE)|S(S(H_(HOST_PUBLIC_KEY_MD5|P(RIVATE_KEYFILE|UBLIC_KEYFILE)|AUTH_TYPES)|L(CERT(TYPE|PASSWD)?|_(CIPHER_LIST|VERIFY(HOST|PEER))|ENGINE(_DEFAULT)?|VERSION|KEY(TYPE|PASSWD)?))|TDERR)|H(TTP(GET|HEADER|200ALIASES|_VERSION|PROXYTUNNEL|AUTH)|EADER(FUNCTION)?)|N(O(BODY|SIGNAL|PROGRESS)|ETRC)|C(RLF|O(NNECTTIMEOUT(_MS)?|OKIE(SESSION|JAR|FILE)?)|USTOMREQUEST|ERTINFO|LOSEPOLICY|A(INFO|PATH))|T(RANSFERTEXT|CP_NODELAY|IME(CONDITION|OUT(_MS)?|VALUE))|I(N(TERFACE|FILE(SIZE)?)|PRESOLVE)|DNS_(CACHE_TIMEOUT|USE_GLOBAL_CACHE)|U(RL|SER(PWD|AGENT)|NRESTRICTED_AUTH|PLOAD)|P(R(IVATE|O(GRESSFUNCTION|XY(TYPE|USERPWD|PORT|AUTH)?|TOCOLS))|O(RT|ST(REDIR|QUOTE|FIELDS)?)|UT)|E(GDSOCKET|NCODING)|VERBOSE|K(RB4LEVEL|EYPASSWD)|QUOTE|F(RESH_CONNECT|TP(SSLAUTH|_(S(SL|KIP_PASV_IP)|CREATE_MISSING_DIRS|USE_EP(RT|SV)|FILEMETHOD)|PORT|LISTONLY|APPEND)|ILE(TIME)?|O(RBID_REUSE|LLOWLOCATION)|AILONERROR)|WRITE(HEADER|FUNCTION)|LOW_SPEED_(TIME|LIMIT)|AUTOREFERER)|PRO(XY_(SOCKS(4|5)|HTTP)|TO_(S(CP|FTP)|HTTP(S)?|T(ELNET|FTP)|DICT|F(TP(S)?|ILE)|LDAP(S)?|ALL))|E_(RE(CV_ERROR|AD_ERROR)|GOT_NOTHING|MALFORMAT_USER|BAD_(C(ONTENT_ENCODING|ALLING_ORDER)|PASSWORD_ENTERED|FUNCTION_ARGUMENT)|S(S(H|L_(C(IPHER|ONNECT_ERROR|ERTPROBLEM|ACERT)|PEER_CERTIFICATE|ENGINE_(SETFAILED|NOTFOUND)))|HARE_IN_USE|END_ERROR)|HTTP_(RANGE_ERROR|NOT_FOUND|PO(RT_FAILED|ST_ERROR))|COULDNT_(RESOLVE_(HOST|PROXY)|CONNECT)|T(OO_MANY_REDIRECTS|ELNET_OPTION_SYNTAX)|O(BSOLETE|UT_OF_MEMORY|PERATION_TIMEOUTED|K)|U(RL_MALFORMAT(_USER)?|N(SUPPORTED_PROTOCOL|KNOWN_TELNET_OPTION))|PARTIAL_FILE|F(TP_(BAD_DOWNLOAD_RESUME|SSL_FAILED|C(OULDNT_(RETR_FILE|GET_SIZE|S(TOR_FILE|ET_(BINARY|ASCII))|USE_REST)|ANT_(RECONNECT|GET_HOST))|USER_PASSWORD_INCORRECT|PORT_FAILED|QUOTE_ERROR|W(RITE_ERROR|EIRD_(SERVER_REPLY|227_FORMAT|USER_REPLY|PAS(S_REPLY|V_REPLY)))|ACCESS_DENIED)|ILE(SIZE_EXCEEDED|_COULDNT_READ_FILE)|UNCTION_NOT_FOUND|AILED_INIT)|WRITE_ERROR|L(IBRARY_NOT_FOUND|DAP_(SEARCH_FAILED|CANNOT_BIND|INVALID_URL))|ABORTED_BY_CALLBACK)|VERSION_NOW|FTP(METHOD_(MULTICWD|SINGLECWD|NOCWD)|SSL_(NONE|CONTROL|TRY|ALL)|AUTH_(SSL|TLS|DEFAULT))|AUTH_(GSSNEGOTIATE|BASIC|NTLM|DIGEST|ANY(SAFE)?))|I(MAGETYPE_(GIF|XBM|BMP|SW(C|F)|COUNT|TIFF_(MM|II)|I(CO|FF)|UNKNOWN|J(B2|P(X|2|C|EG(2000)?))|P(SD|NG)|WBMP|WEBP)|NPUT_(REQUEST|GET|SE(RVER|SSION)|COOKIE|POST|ENV)|CONV_(MIME_DECODE_(STRICT|CONTINUE_ON_ERROR)|IMPL|VERSION))|D(NS_(MX|S(RV|OA)|HINFO|N(S|APTR)|CNAME|TXT|PTR|A(NY|LL|AAA|6)?)|OM(STRING_SIZE_ERR|_(SYNTAX_ERR|HIERARCHY_REQUEST_ERR|N(O(_(MODIFICATION_ALLOWED_ERR|DATA_ALLOWED_ERR)|T_(SUPPORTED_ERR|FOUND_ERR))|AMESPACE_ERR)|IN(DEX_SIZE_ERR|USE_ATTRIBUTE_ERR|VALID_(MODIFICATION_ERR|STATE_ERR|CHARACTER_ERR|ACCESS_ERR))|PHP_ERR|VALIDATION_ERR|WRONG_DOCUMENT_ERR)))|JSON_(BIGINT_AS_STRING|HEX_(TAG|QUOT|A(MP|POS))|NUMERIC_CHECK|ERROR_(S(YNTAX|TATE_MISMATCH)|NONE|CTRL_CHAR|DEPTH|RECURSION|INF_OR_NAN|UNSUPPORTED_TYPE|INVALID_PROPERTY_NAME|UTF(8|16))|FORCE_OBJECT|UNESCAPED_(LINE_TERMINATORS|SLASHES|UNICODE)|OBJECT_AS_ARRAY|PARTIAL_OUTPUT_ON_ERROR|PRESERVE_ZERO_FRACTION|PRETTY_PRINT|INVALID_UTF8_(IGNORE|SUBSTITUTE)|THROW_ON_ERROR)|P(REG_(RECURSION_LIMIT_ERROR|GREP_INVERT|BA(CKTRACK_LIMIT_ERROR|D_UTF8_(OFFSET_ERROR|ERROR))|S(PLIT_(NO_EMPTY|OFFSET_CAPTURE|DELIM_CAPTURE)|ET_ORDER)|NO_ERROR|INTERNAL_ERROR|JIT_STACKLIMIT_ERROR|OFFSET_CAPTURE|PATTERN_ORDER|UNMATCHED_AS_NULL)|SFS_(PASS_ON|ERR_FATAL|F(EED_ME|LAG_(NORMAL|FLUSH_(CLOSE|INC))))|CRE_VERSION|OSIX_(R_OK|X_OK|S_IF(REG|BLK|SOCK|CHR|IFO)|F_OK|W_OK|RLIMIT_(AS|C(ORE|PU)|DATA|FSIZE|LOCKS|M(EMLOCK|SGQUEUE)|N(ICE|OFILE|PROC)|R(SS|T(PRIO|TIME))|S(IGPENDING|TACK)|INFINITY)))|F(NM_(NOESCAPE|CASEFOLD|P(ERIOD|ATHNAME))|IL(TER_(REQUIRE_(SCALAR|ARRAY)|SANITIZE_(MAGIC_QUOTES|S(TRI(NG|PPED)|PECIAL_CHARS)|NUMBER_(INT|FLOAT)|URL|E(MAIL|NCODED)|FULL_SPECIAL_CHARS)|NULL_ON_FAILURE|CALLBACK|DEFAULT|UNSAFE_RAW|VALIDATE_(REGEXP|BOOLEAN|I(NT|P)|URL|EMAIL|FLOAT)|F(ORCE_ARRAY|LAG_(S(CHEME_REQUIRED|TRIP_(BACKTICK|HIGH|LOW))|HOST_REQUIRED|NO(NE|_(RES_RANGE|PRIV_RANGE|ENCODE_QUOTES))|IPV(4|6)|PATH_REQUIRED|E(MPTY_STRING_NULL|MAIL_UNICODE|NCODE_(HIGH|LOW|AMP))|QUERY_REQUIRED|ALLOW_(SCIENTIFIC|HEX|THOUSAND|OCTAL|FRACTION))))|E(_(BINARY|SKIP_EMPTY_LINES|NO_DEFAULT_CONTEXT|TEXT|IGNORE_NEW_LINES|USE_INCLUDE_PATH|APPEND)|INFO_(RAW|MIME(_(TYPE|ENCODING))?|SYMLINK|NONE|CONTINUE|DEVICES|PRESERVE_ATIME)))|ORCE_(GZIP|DEFLATE))|LIBXML_(XINCLUDE|N(SCLEAN|O(XMLDECL|BLANKS|NET|CDATA|E(RROR|MPTYTAG|NT)|WARNING))|COMPACT|D(TD(VALID|LOAD|ATTR)|OTTED_VERSION)|PARSEHUGE|ERR_(NONE|ERROR|FATAL|WARNING)|VERSION|LOADED_VERSION|BIGLINES))\b
The expanded version:
(?x:\b(?:
CURL_HTTP_VERSION_1_0
| CURL_HTTP_VERSION_1_1
| CURL_HTTP_VERSION_NONE
# snip
| XML_TEXT_NODE
)\b)
The sample file looked like this:
// SYNTAX TEST "Packages/PHPTest/PHP Test Compact Capturing.sublime-syntax"
CURL_HTTP_VERSION_1_0
CURL_HTTP_VERSION_1_1
CURL_HTTP_VERSION_NONE
...
XML_TEXT_NODE
It contained each of the 841 identifiers matched by the regexp. The entire list of identifiers was repeated 300 times, for a total of 252,302 lines including the syntax test declaration and subsequent blank line.
For each of the three expressions, I ran one test with set: main
in the match rule and one without. I ran each test three times and took the median, rounded to the nearest millisecond. The results were as follows:
Time | Test |
---|---|
167 ms | Compact, Capturing |
167 ms | Compact, Noncapturing |
169 ms | Compact, Noncapturing |
427 ms | Set, Compact, Capturing |
170 ms | Set, Compact, Noncapturing |
171 ms | Set, Expanded |
My conclusions:
set
is cheap when there are no meta scopes or cleared scopes to deal with. Adding the set
rule had only a trivial impact on performance — except in the case of the capturing expression, because it stopped Sublime from ignoring the unused capture groups. Presumably the reason for this is so that pushed contexts can refer to the captures with faux-backreferences.tl;dr: Regexp formatting doesn't matter, but don't use unnecessary capture groups.
If that is true, just use whatever stemming seems natural?
(?x:\b(?:
CURL_HTTP_VERSION_(?:1_0|1_1|NONE) # Sure.
|(?:XML_TEXT|CDATA|ENTITY(?:_REF)?)_NODE # Maybe?
|E(RROR_(RECURSIVE_ENTITY_REF|MISPLACED_XML_PI|B(INARY_ENTITY_REF|AD_CHAR_REF) ...) # No!
)\b)
I confirm all your conclusions. This is exactly what I've experienced so far.
Maybe one addition:
Example:
Replace the lookahead by (?!\<)
and run the syntax tests against a file with 100k lines containing
print <angular quoted text>;
positive lookahead: 373ms. negative lookahead: 401ms. (+7.5%)
That's interesting, because both expressions should produce the same FSM — at least, assuming that lookahead is implemented with AFAs and then turned into an NFA via the powerset construction, which is admittedly mere speculation. Maybe there's some sort of special lookahead implementation.
Somehow the implementation must differ as negative lookaheads work if nested in another one, but positive lhs don't. (?=(?!blabla))
is ok. In (?=bla(?=bla))
the nested one is ignored.
Yeah, I just noticed that when trying to see if I could speed up the JavaScript syntax by rewriting {{identifier_break}}
as a positive lookahead. This seems like a rather serious bug.
I dug into this and it's even weirder. See referenced issue.
This seems like a rather serious bug.
A serious bug that more or less no one has run into since… people don't like reading regexes like that? :-D
Having plumbed the depths of the bug in the course of writing #2918, my estimation of its severity has lessened, mostly because it only applies in fairly specific circumstances. If it were really as general as it seemed at first, I imagine that someone would have noticed it before now.
Do you have any insight into how capture groups and lookaheads are implemented in the internal engine? Those are the only supported features that don't have a standard, no-brainer FSM implementation, and the evidence seems to indicate that they're not simply compiled into a slightly-fancier FSM.
I don't know off of the top of my head, but I do believe at least some of those details are "proprietary". 🙂
@jfcherng As the original topic seems to be moot, should this issue be closed?
My major concern about the PHP syntax is most likely be kind of solved in https://github.com/sublimehq/Packages/pull/2134 hence closed.
Problem
The original discussion is at https://github.com/sublimehq/Packages/pull/1740#issuecomment-449503763.
As a tl;dr, please see this diff. It's really hard to just add new tokens into old rules because those regexes are "optimized" with the cost of readability (maintainability).
Proposed Solution
We store those tokens in a new file, say
parser-token.txt
which looks likeWe write a
.sublime-syntax
template, sayparser-token.tpl.yaml
, for it.Then, we write a python script, say
compile_templates.py
, to do the following jobs.parser-token.txt
parser-token.tpl.yaml
parser-token.txt
generate compiled syntax file
parser-token.sublime-syntax
with the template and the regexPHP Source.sublime-syntax
to include the compiled syntax fileNow, we maintain
parser-token.txt
rather than the unreadable optimized regex.Proof of Concept
https://github.com/jfcherng/Packages/tree/poc-regex-optimize/PHP/Tokens
Related Scripts