Overhaul Parse.rs - Githubissues

rbk2kb commented 10 years ago

Issue for discussion on improvements to and refactoring of the parser. Current associated tasks:

[x] Clean up _parse_recursive() by factoring out parsing into methods
[x] Add support for parsing unicode character classes
[x] Add support for parsing ASCII character classes
[x] Adding parsing support for additional escape sequences (\A, \v, \z, etc.)
[x] Add support for flags in groups

princjef commented 10 years ago

Added basic unicode character class parsing. Single letter general property values and the Cs general property value are not handled yet

rbk2kb commented 10 years ago

_parse_recursive() is much cleaner now. I factored out the parsing of the repetition operators (*, +, and ?) and explicitly bounded repetitions ({a, b}, {a,}, etc.) into there own methods and renamed them to make their purpose more obvious.

princjef commented 10 years ago

Added support for ascii character classes, both on their own and when nested inside of a normal character class. Nested Unicode character classes do not appear to be supported at this time but it seems like a quick fix. Famous last words...

princjef commented 10 years ago

All escape sequences that appear inside of a character class are actually interpreted as escape sequences now.

There is still an issue with propagating negation of a character class down to its internal escape sequences but I'm going to punt on that until Richard finishes implementing the other escape sequences. I'm thinking something along the lines of either passing in a negation bool (see parse_unicode_charclass()) or adding another value to the Expr enum like Negation(~Expr) that would wrap the rest of the instructions at the compile phase and automatically negate them.

Any thoughts?

rbk2kb commented 10 years ago

So, I have two digit hex escapes (e.g. \x3f) implemented, but as it turns out the other hex syntax for six digit escapes (e.g. \x{ffffff}) seems to be specifying a UTF-8 encoding, which greatly complicates things. UTF-8 characters need to follow a specific sequence of bytes or they are invalid. Our favorite series of regular expression articles talks about it here under Stage 3: Compile and gives the finite automata purportedly used by RE2. Parsing these escapes is mostly redundant with unicode, so I might put it off for now and deal with it later.

rbk2kb commented 10 years ago

Nah, I'm just going to work on it now. I am also going to blindly assume that the given DFA is correct.

rbk2kb commented 10 years ago

Never mind! The Rust devs do love me and included a UTF-8 decoder in std::str

rbk2kb commented 10 years ago

All the additional escapes are parsed now except for \C, which is supposed to parse as any single byte no matter the encoding. That escape is very problematic because the VM's main loop keeps one central stack pointer for all the threads currently executing. The stack pointer is incremented by the length of each character and that value can be more than a single byte because strings are UTF-8 encoded in Rust. By allowing a single byte to be matched we create the opportunity for one thread to be out of sync with all of the others. The only solution I see to this is to have each thread store its own stack pointer and increment it depending on the current instruction, which will involve a lot of changes to exec.rs.

princjef commented 10 years ago

Let's punt on that and open up a separate issue for it

rbk2kb commented 10 years ago

Agreed

princjef commented 10 years ago

The case insensitive flag is an issue because of Unicode. Some characters will case fold to two or more characters in the opposite case, which complicates things mightily. The to_uppercase() and to_lowercase() functions in std::char only implement simple case folding (one character maps to one character). I'm inclined to only support this type of case folding until the standard library integrates support for the more complicated form of case folding. I don't think it's worth reinventing the wheel on this one.

Worth noting though that re2 does support full case folding so this would be a departure from the behavior of that engine.

princjef commented 10 years ago

As I scour through the re2 implementations of other things, I can't see any case folding mappings that map one character to multiple characters so as far as I can see, re2 actually doesn't implement full case folding. Definitely not going to worry about that this time around

princjef / regex-rust

Overhaul Parse.rs #2