pomsky-lang / pomsky

A new, portable, regular expression language
https://pomsky-lang.org
Apache License 2.0
1.28k stars 19 forks source link

Optimization: Deduplicate alternatives #60

Open RunDevelopment opened 1 year ago

RunDevelopment commented 1 year ago

Is your feature request related to a problem? Please describe.

Duplicate alternatives can happen when combining different variables, or simply by accident.

Example: A subset of C keyword with while appearing twice:

% (
  | "break"
  | "case"
  | "char"
  | "const"
  | "continue"
  | "default"
  | "do"
  | "double"
  | "else"
  | "float"
  | "for"
  | "if"
  | "int"
  | "long"
  | "return"
  | "short"
  | "sizeof"
  | "struct"
  | "switch"
  | "void"
  | "while"
  | "while"
) %

Describe the solution you'd like

Assuming no side effects (e.g. capturing groups), duplicate alternatives can be removed without affecting the behavior of the pattern (negatively).

More specifically, given an alternation, let index(x) -> int be a function that returns the index of a given alternative of the alternation in its list of alternatives. If there exist two alternatives a and b such that index(a) < index(b), then b can be removed without effecting the pattern.

This optimization might even prevent very simple cases of exponential backtracking. E.g. ( "a" | "a" )+ "b".

Describe alternatives you've considered

A more complete solution would be to implement a more complete DFA-based solution to determine whether a given alternative is a subset of the union of all previous alternatives in the alternation. However, this is significantly more computationally expensive and does not support irregular features such as assertions.

Additional context

The regexp/no-dupe-disjunctions rule is an implementation of this optimization. This rule uses the DFA approach along if a simpler regex-comparison approach to determine duplicate alternatives.

lppedd commented 1 year ago

We need to make a list of all these optimizations. Even if they don't get into the compiler immediately, they can be supported in the IDE as inspections.

Aloso commented 1 year ago

@lppedd while an inspection would be nice, it should also exist as an optimization, which works in the following situation:

let x = 'foo' | 'bar' | 'baz';
let y = 'bar' | 'quux';

x | y

The output is foo|bar|baz|bar|quux, but should be optimized to foo|bar|baz|quux.