rust-lang / regex

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
https://docs.rs/regex
Apache License 2.0
3.52k stars 439 forks source link

Regular expression derivatives for DFA construction #603

Closed windelbouwman closed 5 years ago

windelbouwman commented 5 years ago

Hi all,

I'm not sure what algorithm you used in this regex crate for DFA construction, but you might be interested in this paper: https://www.ccs.neu.edu/home/turon/re-deriv.pdf

The basic idea is to use regular expression derivatives to construct a DFA with a minimum number of states. The algorithm better than the most commonly used ones, but seems to be forgotten about. There is an implementation of this written in python: https://github.com/michaelpaddon/epsilon

Would it be possible to implement several different algorithms next to each other and make it configurable which one to use?

Regards, Windel

BurntSushi commented 5 years ago

I'm well aware of regex derivatives and have read that paper several times.

If you want to use regex derivatives, then I suggest you take on the research task of figuring out how to productionize it. You claim it is "better" than the most commonly used ones, but what practical experience do you have backing that up?

Regex derivatives are a nice theoretical result, and they might even be useful in certain limited circumstances where you absolutely need the full DFA. But regex engines aren't just about reporting whether a match exists or not.

but seems to be forgotten about

Trust me, among folks writing regex engines, it is not forgotten about.

I apologize if my tone appears harsh, but your issue request is basically, "can someone spend months of their time researching and productionizing this alternative regex construction algorithm even though its benefits over the existing approach aren't clear?"

I'm not sure what algorithm you used in this regex crate for DFA construction

It uses a lazy DFA, which protects against exponential state blow up. Something that regex derivatives cannot do (or at least, nobody has shown how).

windelbouwman commented 5 years ago

I apologize if my tone appears harsh, but your issue request is basically, "can someone spend months of their time researching and productionizing this alternative regex construction algorithm even though its benefits over the existing approach aren't clear?"

No problem! My request was not for someone to spend months of their time on this, I just wanted to check if this algorithm was considered / known. It is good to know that it was considered, thank you for your answer!