sbosnick / luther

Luther is an embedded lexer generator for stable Rust.
Apache License 2.0
5 stars 2 forks source link

Re-write the Re-Dfa Implementation #3

Open sbosnick opened 6 years ago

sbosnick commented 6 years ago

luther-derive currently uses the redfa crate, but that crate doesn't support (and can't easily be extended to support) the character sets extension in section 4.2 of Scott Owens et al. Regular Expression Derivatives Reexamined. The character sets extension is necessary for proper Unicode support which in turn is necessary for moving toward feature parity with the regex crate regular expressions.

The change proposed here would implement a luther-redfa crate that is based around the three extensions in section 4 of Owens et al. (maintaining regular expression in the weaker equivalence canonical form, the characters sets mentioned above, and regular vectors).

The core of the luther-redfa will be the dfa construction algorithm in figure 3 of Owens et al.

The implementation plan for this new feature is:

ids1024 commented 5 years ago

is necessary for moving toward feature parity with the regex crate regular expressions

Have you thought of wrapping the regex-automata crate? The developer is also the maintainer of the regex crate, and it should have a decent feature set as far as the regex syntax it handles (such as optional Unicode support).

sbosnick commented 5 years ago

Thank you for drawing my attention to he regex-automata crate. I was not aware of it. That crate hadn't been started when I started to implement the the luther-redfa crate, though Andrew Gallant has progressed faster with the former than I have with the latter.

Both regex-automata and luther-redfa rely on the regex-syntax crate for parsing regular expressions (which is also the basis for regular expression parsing in the regex crate. This common parser will mean that all three crates mostly accept the same regular expressions.

The reason why I can't switch to regex-automata at this point is because that crate doesn't support regex sets (see BurntSushi/regex-automata#4) which is a required feature for luther-derive.

luther-redfa is currently blocked because it has horrendous performance. I had worked on this performance issue for a period of time without getting to a point of sufficient improvement before I moved on to another project. I think I now know how to move the performance needle from horrendous to just being terrible (that is, I know how to get some improvement in the performance but I don't know if it will be enough).

If regex-automata gets support for regex sets and if I am not able to get adequate performance from luther-redfa then I will investigate switching to regex-automata. Even if I do get adequate performance from luther-redfa it may still be worth looking at an implementation that relies on regex-automata because there may be different trade-offs at play (regex-automata and luther-redfa are based on different algorithms for generating a DFA).

sbosnick commented 4 years ago

I have a new blog post that reviews the theory behind the improvements I am working on to the performance of luther-redfa. The post is found here.

I am still working on turning this theory into practice.