Optimization: Deduplicate alternatives

pomsky-lang / pomsky

A new, portable, regular expression language

https://pomsky-lang.org

Apache License 2.0

1.29k stars 19 forks source link

Optimization: Deduplicate alternatives #60

Open RunDevelopment opened 1 year ago

RunDevelopment commented 1 year ago

Is your feature request related to a problem? Please describe.

Duplicate alternatives can happen when combining different variables, or simply by accident.

Example: A subset of C keyword with while appearing twice:

% (
  | "break"
  | "case"
  | "char"
  | "const"
  | "continue"
  | "default"
  | "do"
  | "double"
  | "else"
  | "float"
  | "for"
  | "if"
  | "int"
  | "long"
  | "return"
  | "short"
  | "sizeof"
  | "struct"
  | "switch"
  | "void"
  | "while"
  | "while"
) %

Describe the solution you'd like

Assuming no side effects (e.g. capturing groups), duplicate alternatives can be removed without affecting the behavior of the pattern (negatively).

More specifically, given an alternation, let index(x) -> int be a function that returns the index of a given alternative of the alternation in its list of alternatives. If there exist two alternatives a and b such that index(a) < index(b), then b can be removed without effecting the pattern.

This optimization might even prevent very simple cases of exponential backtracking. E.g. ( "a" | "a" )+ "b".

Describe alternatives you've considered

A more complete solution would be to implement a more complete DFA-based solution to determine whether a given alternative is a subset of the union of all previous alternatives in the alternation. However, this is significantly more computationally expensive and does not support irregular features such as assertions.

Additional context

The regexp/no-dupe-disjunctions rule is an implementation of this optimization. This rule uses the DFA approach along if a simpler regex-comparison approach to determine duplicate alternatives.

lppedd commented 1 year ago

We need to make a list of all these optimizations. Even if they don't get into the compiler immediately, they can be supported in the IDE as inspections.

Aloso commented 1 year ago

@lppedd while an inspection would be nice, it should also exist as an optimization, which works in the following situation:

let x = 'foo' | 'bar' | 'baz';
let y = 'bar' | 'quux';

x | y

The output is foo|bar|baz|bar|quux, but should be optimized to foo|bar|baz|quux.

Aloso commented 6 days ago

The more general (and perhaps more common) optimization would be prefix merging:

'foobar' | 'foo'     -> 'foo' 'bar'?
'boring' | 'boeing'  -> 'bo' ('ring' | 'eing')

(The latter could even become bo[er]ing by looking at the end as well, but this is not necessary)

The tricky part is to do this without changing precedence. It can be simplified by not using ? and compiling the first one to foo(?:bar|), so we don't have to take laziness into account.

Alternatives can sometimes be reordered, but not always:

'foobar' | ['d'-'f'] | 'foo'

Here it's not allowed, since the character set includes the f. It could be turned into the following:

['de'] | 'f' ('oobar' | '' | 'oo')

by removing the f from the character set, but I won't implement that, since the benefit is unclear.

Note that compiling it to [de]|f(?:oo(bar)?)? would be wrong because it has different precedence, so the foobar and foo alternatives can't be merged.