jgm / skylighting

A Haskell syntax highlighting library with tokenizers derived from KDE syntax highlighting descriptions
189 stars 61 forks source link

Regex engine: add support for possessive and lazy quantifiers #109

Closed jgm closed 3 years ago

jgm commented 3 years ago

In a++ the second + causes the first quantifier + to be interpreted as "possessive."

Thus, a+a will match aaa but a++a will not.

In a+?, the ? causes the first quantifier to be interpreted as "lazy," so that it matches as few characters as necessary for the rest of the pattern to succeed.

Note also that a+++ is invalid. Currently we allow quantifiers to be applied to arbitrary patterns, but they should not be allowed to stack like this.

jgm commented 3 years ago

Lazy quantifiers seem to be used only in: doxygen.xml

5:    <!ENTITY sl_word ".*?(?=&wordsep;)">
6:    <!ENTITY ml_word ".*?(?=&wordsep;|\*/)">

doxygen-lua.xml

5:    <!ENTITY sl_word ".*?(?=&wordsep;)">

So failing to support these isn't going to be a huge problem.

Possessive quantifiers are more widely used. c.xml

4:    <!ENTITY int "(?:[0-9]++)">
5:    <!ENTITY hex_int "(?:[0-9A-Fa-f]++)">
236:        <RegExpr attribute="Binary" context="IntSuffix" String="0[Bb][01]++" />
237:        <RegExpr attribute="Octal" context="IntSuffix" String="0[0-7]++" />
238:        <RegExpr attribute="Decimal" context="IntSuffix" String="0(?![xXbB0-9])|[1-9][0-9]*+" />
239:        <RegExpr attribute="Error" context="#pop" String="[._0-9A-Za-z']++" />

However, these are all at the end of the pattern, so the difference between possessive and non-posessive seems irrelevant.

fasm.xml

skylighting-core/xml/fasm.xml
4:    <!ENTITY float "[0-9]++(?:\.[0-9]*+(?:e[-+]?[0-9]*+)?|(?=f\b)|e[-+]?[0-9]++\b)">
6:    <!ENTITY hex_cont "[0-9a-f]*+(?=h)">
7:    <!ENTITY oct_hex_cont "[0-7]*+(?:&hex_cont;|(?=[oqh]))">
10:    <!ENTITY bin_oct_hex "[01]*+(?:&oct_hex_cont;|(?=[byoqh]))">
11:    <!ENTITY baseN "0x[0-9a-f]*+|&bin_oct_hex;|&oct_hex;|&hex;">
13:    <!ENTITY number "[0-9]*+(?:to[0-9]+|(?=d?))">
1613:        <RegExpr attribute="Label" context="Instruction" String="(?:(?:\.|@@|&#37;&#37;)(?:\.@)?)?[A-Za-z_][A-Za-z0-9_.]*+:|\.:|@@:" firstNonSpace="1"/>

(also nasm.xml)

isocpp.xml

4:    <!ENTITY int "(?:[0-9](?:'?[0-9]++)*+)">
5:    <!ENTITY hex_int "(?:[0-9A-Fa-f](?:'?[0-9A-Fa-f]++)*+)">
32:    extensions="*.c++;*.cxx;*.cpp;*.cc;*.C;*.h;*.hh;*.H;*.h++;*.hxx;*.hpp;*.hcc;*.moc"
401:        <RegExpr attribute="Binary" context="IntSuffix" String="0[Bb][01](?:'?[01]++)*+" />
402:        <RegExpr attribute="Octal" context="IntSuffix" String="0(?:'?[0-7]++)++" />
403:        <RegExpr attribute="Decimal" context="IntSuffix" String="0(?![xXbB0-9])|[1-9](?:'?[0-9]++)*+" />
404:        <RegExpr attribute="Error" context="#pop" String="[._0-9A-Za-z']++" />

scheme.xml

4:  <!ENTITY xmlattrs "\s+([^&quot;/>]++|&quot;[^&quot;]*+&quot;)*+">
6:  <!ENTITY regex    "(?:[^\\(\[/]++|\\.|\[\^?\]?([^\\\[\]]++|\\.|\[(:[^:]+:\])?)++\]|\((\?R)\))+">
10:  <!ENTITY initial_others "\\x[0-9a-fA-F]++;|(?![\x01-\x7f])[&initial_unicode_set;]">
13:  <!ENTITY symbol "(?:&initial;&subsequent;*+)">
826:        <RegExpr attribute="Float" context="#stay" String="[0-9]*+\.[0-9]++([esfdl][+-]?[0-9]++)?|[0-9]++[esfdl][+-]?[0-9]++"/>
827:        <RegExpr attribute="Decimal" context="#stay" String="[0-9]++"/>
908:        <RegExpr attribute="XML Tag" context="#pop!XMLTag" String="#&lt;[^\s>]++"/>
927:        <RegExpr attribute="XML Attribute" context="XMLAttribute" String="[^\s=/>]++\s*"/>
942:        <RegExpr attribute="XML Tag" context="#pop!XMLTag" String="&lt;[^\s>]++"/>
1050:        <RegExpr attribute="Float" context="#pop" String="[0-9]*+\.[0-9]++([esfdl][+-]?[0-9]++)?|[0-9]++[esfdl][+-]?[0-9]++"/>
1051:        <RegExpr attribute="Decimal" context="#pop" String="[0-9]++"/>
1152:        <RegExpr attribute="Comment" context="#pop!DatumCommentXMLTag" String="[^\s>]++\s*"/>
1177:        <RegExpr attribute="Comment" context="DatumCommentXMLAttribute" String="[^\s=/>]++\s*"/>
1192:        <RegExpr attribute="Comment" context="#pop!DatumCommentXMLTag" String="&lt;[^\s>]++"/>
1241:        <RegExpr attribute="Comment" context="#stay" String="(?:[^\\&quot;]++|\\.)++"/>