pemistahl / grex

A command-line tool and Rust library with Python bindings for generating regular expressions from user-provided test cases
https://pemistahl.github.io/grex-js/
Apache License 2.0
7.27k stars 173 forks source link

Problems to consider when making anchors optional #31

Closed mathiasbynens closed 3 years ago

mathiasbynens commented 3 years ago

It seems grex inherited this bug from regexgen: https://github.com/devongovett/regexgen/issues/31

Repro:

$ cat input
AGBHD
EIBCD
EGBCD
FBJBF
AGBH
EIBC
EGBC
EBC
FBC
CD
F
C
ABCD
EBCD
FBCD

$ # note the last entry to be matched, i.e. "FBCD"

$ grex --file input
^(?:F(?:BJBF)?|(?:E(?:[GI])?BC|(?:FB)?C)D?|A(?:GBHD?|BCD))$

After removing ^ and $ (see #30), this generated pattern does not match "FBCD" despite it being one of the input strings:

'FBCD'.match(/(?:F(?:BJBF)?|(?:E(?:[GI])?BC|(?:FB)?C)D?|A(?:GBHD?|BCD))/g);
// → ['F', 'CD']

Here’s what I think the bug is: within the generated pattern, it should never happen that something on the left matches a prefix of something that's further on the right, because then the latter can never match.

See https://github.com/devongovett/regexgen/issues/31#issuecomment-801380409 for some more details.

dubzzz commented 3 years ago

I think I have a commit (need to be adapted for grex) that could help to detect those kind of issues earlier: https://github.com/dubzzz/regexgen/commit/b0029972dcb45ca24f82d16dfeede6e942ff4278#diff-a561630bb56b82342bc66697aee2ad96efddcbc9d150665abd6fb7ecb7c0ab2fR140-R157

The test:

For any input passed to regexgen (or grex in that case) I expect the produced regex to be able to match any of the string in input (even if prefixed/suffixed by something else)

It might be a cool addition for the property based test suite 🤔

pemistahl commented 3 years ago

Hi @mathiasbynens and @dubzzz, thank you for making me aware of this "problem". If I decide to make the anchors optional in some later release, I will revisit this issue here again.

pemistahl commented 3 years ago

@mathiasbynens While reviewing pull request #43, I've been dealing with this issue here again. I think this problem can be easily solved by sorting the test cases by length in descending order before doing any further processing. This way, test cases that happen to be prefixes of other test cases appear at the end of the generated regex and not at the start. In my opinion, it is not the DFA minimization algorithm that needs to be adjusted.

Do you have any further thoughts?

mathiasbynens commented 3 years ago

@pemistahl That's exactly how I've been working around the issue in my own code: sorting the list before passing it to grex. It would be amazing to not have to do that anymore :)

pemistahl commented 3 years ago

The test cases are already sorted by grex but in the wrong order. Alright, then I will fix this in release 1.3.0. Thanks for your feedback.

pemistahl commented 3 years ago

@mathiasbynens I have not been able so far to successfully formulate a sorting algorithm that always sorts the alternations in a regex without anchors in the correct order. For the time being, I've disabled DFA minimization for when both start and end anchors are disabled. The minimization algorithm is not the problem, however. It is somewhere in the business logic constructing the AST from the DFA. If I don't find a solution to this problem, I will release the optional anchors feature as it currently is. At least, the resulting regex is guaranteed to be correct.