pomsky-lang / pomsky

A new, portable, regular expression language
https://pomsky-lang.org
Apache License 2.0
1.28k stars 19 forks source link

Optimize single-character alternatives #58

Open RunDevelopment opened 1 year ago

RunDevelopment commented 1 year ago

Is your feature request related to a problem? Please describe.

BNF grammars (including dialects) commonly denote large character sets like this

letter = "A" | "B" | "C" | "D" | "E" | "F" | "G"
       | "H" | "I" | "J" | "K" | "L" | "M" | "N"
       | "O" | "P" | "Q" | "R" | "S" | "T" | "U"
       | "V" | "W" | "X" | "Y" | "Z" | "a" | "b"
       | "c" | "d" | "e" | "f" | "g" | "h" | "i"
       | "j" | "k" | "l" | "m" | "n" | "o" | "p"
       | "q" | "r" | "s" | "t" | "u" | "v" | "w"
       | "x" | "y" | "z" ;
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
symbol = "[" | "]" | "{" | "}" | "(" | ")" | "<" | ">"
       | "'" | '"' | "=" | "|" | "." | "," | ";" ;
character = letter | digit | symbol | "_" ;

(More examples on Wikipedia)

While there are of course better ways to denote character ranges in pomsky, the union of character sets and list of characters (as in character in the above example) is quite common. However, pomsky currently produces quite suboptimal regexes for this pattern.

Example

Describe the solution you'd like

Please merge adjacent single-character alternatives into one character set. E.g. a|b|c -> [abc].

This optimization is particularly useful because it enabled further optimizations within character sets.

Additional context

For a reference implementation of this optimization, checkout the regexp/prefer-character-class rule. Note that this rule also does some interesting analysis to merge non-adjacent single-character alternatives.

Aloso commented 1 year ago

Thanks for the feature request!

black7375 commented 1 year ago

Here are some trasform rules & implements