pemistahl / grex

A command-line tool and Rust library with Python bindings for generating regular expressions from user-provided test cases
https://pemistahl.github.io/grex-js/
Apache License 2.0
7.06k stars 170 forks source link

Non optimal generated regexp #249

Open Zabrane opened 1 month ago

Zabrane commented 1 month ago

Hi @pemistahl and many thanks for this great piece of software.

I'd like to report a little issue which I'm sure can easily be fixed.

$ grex --version                                                                                                                                                                                                       
grex 1.4.5

$ cat bots.txt
baiduspider
bingbot
duckduckgo
googlebot
yandexbot

$ grex --no-anchors -c -i -f bots.txt
(?i)(?:baiduspider|duckduckgo|(?:google|bing)bot|yandexbot)

This is what i was expecting to get:

$ grex --no-anchors -c -i -f bots.txt
(?i)(?:baiduspider|duckduckgo|(?:google|bing|yandex)bot)

yandexbot shares the same suffix bot with googlebot and bingbot.

Interestingly, when testing with a reduced list of bots all sharing the same suffix, the suffix bot is found but still a non optimal regex is returned:

$ cat bots.txt
bingbot
googlebot
yandexbot

$ grex --no-anchors -c -i -f bots.txt
(?i)(?:(?:google|bing)|yandex)bot

This is what i was expecting to get:

$ grex --no-anchors -c -i -f bots.txt
(?i)((?:google|bing|yandex)bot)

Many thanks