Open RunDevelopment opened 1 year ago
We need to make a list of all these optimizations. Even if they don't get into the compiler immediately, they can be supported in the IDE as inspections.
@lppedd while an inspection would be nice, it should also exist as an optimization, which works in the following situation:
let x = 'foo' | 'bar' | 'baz';
let y = 'bar' | 'quux';
x | y
The output is foo|bar|baz|bar|quux
, but should be optimized to foo|bar|baz|quux
.
The more general (and perhaps more common) optimization would be prefix merging:
'foobar' | 'foo' -> 'foo' 'bar'?
'boring' | 'boeing' -> 'bo' ('ring' | 'eing')
(The latter could even become bo[er]ing
by looking at the end as well, but this is not necessary)
The tricky part is to do this without changing precedence. It can be simplified by not using ?
and compiling the first one to foo(?:bar|)
, so we don't have to take laziness into account.
Alternatives can sometimes be reordered, but not always:
'foobar' | ['d'-'f'] | 'foo'
Here it's not allowed, since the character set includes the f
. It could be turned into the following:
['de'] | 'f' ('oobar' | '' | 'oo')
by removing the f
from the character set, but I won't implement that, since the benefit is unclear.
Note that compiling it to [de]|f(?:oo(bar)?)?
would be wrong because it has different precedence, so the foobar
and foo
alternatives can't be merged.
Is your feature request related to a problem? Please describe.
Duplicate alternatives can happen when combining different variables, or simply by accident.
Example: A subset of C keyword with
while
appearing twice:Describe the solution you'd like
Assuming no side effects (e.g. capturing groups), duplicate alternatives can be removed without affecting the behavior of the pattern (negatively).
More specifically, given an alternation, let
index(x) -> int
be a function that returns the index of a given alternative of the alternation in its list of alternatives. If there exist two alternatives a and b such that index(a) < index(b), then b can be removed without effecting the pattern.This optimization might even prevent very simple cases of exponential backtracking. E.g.
( "a" | "a" )+ "b"
.Describe alternatives you've considered
A more complete solution would be to implement a more complete DFA-based solution to determine whether a given alternative is a subset of the union of all previous alternatives in the alternation. However, this is significantly more computationally expensive and does not support irregular features such as assertions.
Additional context
The
regexp/no-dupe-disjunctions
rule is an implementation of this optimization. This rule uses the DFA approach along if a simpler regex-comparison approach to determine duplicate alternatives.