rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
94.87k stars 12.23k forks source link

Add a regular expressions library to the distribution #3591

Closed brson closed 10 years ago

brson commented 11 years ago

The exact engine we use needs further discussion. Probably we don't want to use yarr because it requires the nitro jit, which is a big dependency and could be undesirable for various reasons. Ideally, we would have a nice syntax extension that precompiles the regexes (in addition to runtime compilation options).

eholk commented 11 years ago

What about writing a regular expression engine in Rust?

brson commented 11 years ago

That could be fine.

erickt commented 11 years ago

For anyone who stumbles across this issue looking for a regex library, we do have a pcre binding in cargo.

killerswan commented 11 years ago

@erickt is that related to @uasi's https://github.com/uasi/rust-pcre ?

erickt commented 11 years ago

@killerswan neat! that seems to be an independent implementation. I'll ping @uasi to merge our code.

graydon commented 11 years ago

nominating for feature-complete

graydon commented 11 years ago

(also, as I mention elsewhere but apparently not in this bug, I prefer re2 for this)

catamorphism commented 11 years ago

(bug triage) Milestone is correct; could be a good medium-sized starter project for somebody.

nickdesaulniers commented 11 years ago

I'm still interested in this. Been working on an cre2 binding. Not 100% sure how to handle functions that expect c enums.

ktt3ja commented 10 years ago

I would like to claim this issue to do as a project for my university class.

nickdesaulniers commented 10 years ago

@ktt3ja , I haven't been working on this.

glennsl commented 10 years ago

I've been working on this, but it's pretty slow going. See https://github.com/glennsl/rust-re

It has a good set of features already, but is currently pretty naively implemented, horribly slow and not very nice to look at. I'm trying to learn rust as I go, and am currently trying to figure out how to optimize it without wandering into unsafe territory.

Feel free to fork it or implement something completely separate. A different perspective would just be good as far as I'm concerned.

huonw commented 10 years ago

There is a wiki page summarising some research about this: https://github.com/mozilla/rust/wiki/Lib-re

ktt3ja commented 10 years ago

Nvm, I'll be working on something else for my project.

ferristseng commented 10 years ago

I've been working on this for my class. The project is here https://github.com/ferristseng/regex-rust. I've been trying to add the features outlined in the ECMA-262 standard. If anyone has any suggestions or would like to help, let me know (this is my first time working on a regular expression engine).

huonw commented 10 years ago

cc @ssbr

ssbr commented 10 years ago

Evidently there's some duplication of work going on, even if it is early work for everyone. @ferristseng want to join up? I can contribute what I suspect is a more rigorous attempt at a parser (I basically copied the ECMA spec to the letter, with one deviation, so it should be correct -- I did skip a few bits though, and replaced them with empty failing functions), and a large set of tests (150MB worth, I am trying to shrink it down and make them runnable) I've taken out of Firefox. Unfortunately my free time is more limited than I'd like, so I have yet to get around to a good set of runtime implementations. That's the exciting bit I've been trying to get to once all the other pieces are ready.

OTOH I suspect that since this is for school you can't have any outside contributors, which would be disappointing.

ferristseng commented 10 years ago

Yeah I would definitely like to join up. The class is finished now, so any outside contributions would be welcome. I'm not sure of the quality of the code though, but hopefully there is some useful stuff in there.

glennsl commented 10 years ago

Very cool, @ferristseng ! As a start, you might be interested in the testsuite I've adapted from python (see https://github.com/glennsl/rust-re/blob/master/tests.rs). Should be pretty easy to adapt to your own code.

I've been quite busy this past month, but I'm at a point now where I believe I have a fairly good understanding of the issues involved, regarding both performance and functionality. And I'm looking forward to having a bit more time during the holidays to work on this. I would really like to discuss our different approaches and ideas first, though, perhaps via e-mail? My e-mail is firstname at lastname dot net, and my full name is Glenn Slotte.

There is one issue that needs to be addressed in a larger context, however: Unicode. The readme on my repository roughly outlines what's needed for the different levels of Unicode TR18 compliance (there's also a third level of compliance, but that's special interest territory). Normalization and case folding, I think, are the two most important points that need to be addressed and implemented separately (#9084 addresses the case folding).

Kimundi commented 10 years ago

@glennsl Note that we already have nfd and nfkd normalization in the stdlib.

ssbr commented 10 years ago

I've been quite busy this past month, but I'm at a point now where I believe I have a fairly good understanding of the issues involved, regarding both performance and functionality. And I'm looking forward to having a bit more time during the holidays to work on this. I would really like to discuss our different approaches and ideas first, though, perhaps via e-mail? My e-mail is firstname at lastname dot net, and my full name is Glenn Slotte.

Could this discussion be held publicly? I am interested in it.

glennsl commented 10 years ago

@Kimundi Ah, I've missed that, but that should do. Thanks.

@ssbr Yeah I figured you'd want to be involved, and was thinking you, me and @ferristseng could sort out the implementation details and coordinate without bothering everyone else. Sorry for not being very clear on that.

thestinger commented 10 years ago

I don't really like the focus on replicating ECMA-262 because JavaScript has awful Unicode support, especially in regular expressions. It doesn't even have basic Unicode character class support - it's entirely anglo-centric.

eddyb commented 10 years ago

@thestinger the Unicode support is fixed in ES6 (with the u flag), maybe that should be the baseline.

lambda-fairy commented 10 years ago

Are you guys still working on this? I've been hacking on an implementation of my own (https://github.com/lfairy/rose). At this stage, it's mostly a cleaned-up version of @glennsl's library.

Pike VM is good. It isn't terribly fast, but it's simple and predictable, which is what we want in the stdlib.

On Unicode support: we really don't need to handle normalization -- no other popular engine implements that. Input can be normalized outside the library anyway.

In addition, neither Python nor grep perform full case folding (Unicode Level 2). AFAIK, only RE2 and PCRE have this feature, the latter very recently (http://bugs.exim.org/show_bug.cgi?id=1208).

Taking this into account, I think we should aim for Level 1 compliance only.

ferristseng commented 10 years ago

I've been working on cleaning up my code, and adding more features. The code is probably more confusing than it should be.

Something that arose in our discussion was whether or not we should support features like backreferences and assertions. It might be a good idea to support them whether we have a fall back method, where we default to using a recursive backtracking algorithm, or find some other way. The downside to this is depending on what we choose, it will certainly add some complexity and overhead to the code.

lambda-fairy commented 10 years ago

@ferristseng I had a quick look over your code -- I agree that it could do with some cleaning up ;)

As a first step, I suggest making CharClass immutable, and moving the optimization logic from negate into the constructor. The API might look like this:

impl CharClass {
    pub fn new(ranges: ~[(char, char)]);
    pub fn negate(&self) -> self;
    // ...
}

Backreferences obviously require backtracking, but I think you can do forward (lookahead) assertions using Pike VM. Split a separate thread for every assertion; the overall expression matches iff any non-assertion thread matches and all assertion threads match. You can implement this in the VM by adding an AssertionSplit(Position, Position) instruction and keeping separate vectors for assertion and non-assertion threads.

By the way, my email is lambda dot fairy at gmail dot com. I'd love to join in the discussion.

glennsl commented 10 years ago

I have no problem with going for Unicode level 1 to start with, but would like to see level 2 as a long term goal.

Regarding generalized assertions, most implementations have severe limitations on what they support, mostly in lookbehinds, because of the difficulty of implementing them efficiently. It is a problem Russ Cox has pondered for years without having found a solution. I still think it's worthwhile to try to solve it, if for nothing else then just because one would learn a lot about the problem. But if an efficient solution can't be found, I vote for not including the feature, at least by default.

Reliable performance is a more important feature IMO. It seems more in line with the spirit of Rust. But just as Rust has optional GC, I think it would be a good long term goal to be able to opt-in to a feature set with different trade-offs.

I plan on reading through the ES6 draft in the next few days. I don't know of anything else that could serve as a better baseline, but strict compliance was never my intention at least. I think the expectations coming from being a rust library should trump the baseline at least.

thestinger commented 10 years ago

I don't think it makes sense to go beyond what re2 offers. A simple grammar with reliable performance is a far better fit for Rust and is in line with other choices like using SipHash for hash tables.

I'm worried about the focus on ES6 though because JavaScript has bottom of the barrel Unicode support... If you want to know how not to handle strings, JavaScript, Qt and Java are good places to look.

glennsl commented 10 years ago

Yep, the ES6 draft is unfortunately severely lacking in its Unicode support. It seems to add nothing more than translation from code units to code points and simple case folding. Word boundaries and character classes still completely ignore Unicode.

ssbr commented 10 years ago

Backreferences obviously require backtracking, but I think you can do forward (lookahead) assertions using Pike VM. Split a separate thread for every assertion; the overall expression matches iff any non-assertion thread matches and all assertion threads match. You can implement this in the VM by adding an AssertionSplit(Position, Position) instruction and keeping separate vectors for assertion and non-assertion threads.

It's so deceptively simple-sounding. Assertions break concatenation: AB no longer means "a match for A, followed by a match for B". The Pike VM is built entirely on that principle, so you have to gut it to make it work with assertions. The gutting didn't go so well for me, personally.

For example, the threads of the top level NFA might remain distinct even when the threads of the assertions are merged, with the result that the merged assertion thread only has the correct save state for one of the distinct threads -- if the other thread matches instead, the assertion save state is wrong. The whole point of keeping a thread list is to merge threads, so this is an essential problem with keeping a single thread list for all runs of an assertion.

e.g. consider the behavior of ^.*(?=(.).*)...$ on the string "abcd" in this proposed model. The correct matched position is 'b', but this approach matches 'a'. (I've tried to make this regexp fail on as many variations of the idea as possible, but maybe I misdid it, sorry.)

That doesn't invalidate the idea though. There are at most some finite number of distinct threads, M, so while you can't run 1 thread list for all assertion runs, you can run M thread lists -- one assertion thread list per actual live thread. When you merge threads, you only keep the thread list of the thread whose saves you keep. (Note: assertions can start assertions, so this is recursive and exponential in the size of the regexp, but what else is new?)

The only other problem I've noticed (and remember) is that you can't actually ask "all assertions match". There's the | operator, after all. What threads must match is determined by a formula that's built up as you run the regular expression, since you can have things like ^(((?=.*a)|(?=.*b)).)*$ -- here we have basically an arbitrarily large CNF formula of constraints on the string that has as many clauses ANDed together as there are characters in the string. You can definitely keep the formula to a bounded size because you can still merge threads, so there's only finitely many variables (a bit hand wavy, since I have never done this), but... at this point there's a lot of work and a lot of overhead. It doesn't seem attractive anymore.

Other than those problems, maybe it's fine and works. I am not confident I've picked up everything. It's no longer an intuitive algorithm, and for me, when I ran into trouble, it took a lot of effort to mangle the problem into something that'd fit in my head. I'm skeptical that the Pike VM is a good choice for making assertions work. Regexp derivatives seemed more promising, except that they are inherently incompatible with ordered traversal. I don't know a good alternative.

That said, in my experience using RE2, the lack of assertions is really annoying, so I'd want either crippled assertions that don't require much work to implement (e.g. assertions at the top level of a regexp, a special case of: ), and/or (my preference) a real logical "and" operator, which should probably be easy to implement efficiently.

P.S. I thought my email was public from github, but apparently not. jeanpierreda@gmail.com . I'd still prefer a public forum. :(

lambda-fairy commented 10 years ago

Oh, I see now -- thanks for the explanation. I thought that conjunction and lookahead are the same thing, when obviously they are not.

So, to summarize: the difference between conjunction (easy) and lookahead (hard) is that the former requires both branches to match the same amount of input, whereas with the latter, one thread can race ahead of the other. Because the Pike VM always runs threads in lock-step, only conjunction can be implemented without messing up the algorithm.

Is this correct?


As an aside, is there any easy way to implement character class negation with full case folding?

For example, under full case folding, the expression

[^ß]

will match any single character unless the string starts with SS, Ss, sS, ss, ß, or ẞ.

As far as I know, the only way to implement that involves lookahead assertions, which as @ssbr pointed out, are difficult.

Since the Unicode standard lists this feature as "optional", I think it's best just to throw an error if the user attempts negation and FCF in the same expression.

glennsl commented 10 years ago

@ssbr Could you elaborate on what you mean by 'a real logical "and" operator'?

I've been thinking about how API design might help solve the problems typically addressed by backreferences and generalized assertions. E.g. having a lazy match iterator would allow users to filter matches, with a second regex if needed, which is exactly what generalized assertions do. And backreferences could be implemented similarly by rejecting a match if capture group Y does not equal capture group X, although that's not as intuitive as I'd like it to be. The major downside to this approach is perhaps that performance will suffer significantly when a large number of matches should be rejected, because of all the back and forth.

glennsl commented 10 years ago

@lfairy Wouldn't normalization fix that?

lambda-fairy commented 10 years ago

Could you elaborate on what you mean by 'a real logical "and" operator'?

A|B matches the union of the strings represented by A and B.

A&B matches the intersection instead.

Wouldn't normalization fix that?

Even if you case-fold both the regex and the input string, you're still stuck with (?!ss)., i.e. match all strings that don't start with "ss", then consume one character.

glennsl commented 10 years ago

A&B matches the intersection instead.

Sorry, I still don't understand. How does this relate to assertions?

Even if you case-fold both the regex and the input string, you're still stuck with (?!ss)., i.e. match all strings that don't start with "ss", then consume one character.

Right, of course. That's a tricky one. It seems like the recommendation is to just keep to simple case folding for character classes.

ferristseng commented 10 years ago

Perhaps I'm understanding this incorrectly, but the intersection, A&B adds functionality similar to assertions. For example, the expression abc&.*d is similar to the expression (?=.*d)abc. Running the former expression on the string abcd, would yield a match of abc for the left hand side, and abcd for the right hand side, the intersection being abc. Hopefully my understanding was correct!

ssbr commented 10 years ago

So, to summarize: the difference between conjunction (easy) and lookahead (hard) is that the former requires both branches to match the same amount of input, whereas with the latter, one thread can race ahead of the other. Because the Pike VM always runs threads in lock-step, only conjunction can be implemented without messing up the algorithm.

I'm not so sure about the "because" part. I think the unifying theme in the two problems I mentioned are that assertions have a lifetime that can overlap with themselves, in a way that is different from just being under a repetition operator.

As far as I know, the only way to implement that involves lookahead assertions, which as @ssbr pointed out, are difficult.

Are the number of codepoints read in a fully case folded character class always bounded? I think that should be true for normalized strings. If so, I think that assertions with a constant amount of lookahead can be implemented the usual way (just check ahead, without progressing in the string) without affecting correctness or the time/space complexity. And it's pretty fast, too.

And backreferences could be implemented similarly by rejecting a match if capture group Y does not equal capture group X, although that's not as intuitive as I'd like it to be.

That's coding in a mechanism for backtracking search, though. At that point, why not introduce backreferences proper? (Potentially with an alternate VM.)

Perhaps I'm understanding this incorrectly, but the intersection, A&B adds functionality similar to assertions. For example, the expression abc&.*d is similar to the expression (?=.*d)abc. Running the former expression on the string abcd, would yield a match of abc for the left hand side, and abcd for the right hand side, the intersection being abc. Hopefully my understanding was correct!

Nope :( A&B matches if both A and B match the same string. In this case, the left hand side matches "abc", and the right hand side matches anything ending with "d", so they can never match anything at the same time.

The overlap is in a regexp like (?=.*lol)(?=.*wtf), which could also be written as .*lol.*&.*wtf.* (i.e. match any string with both the substrings "lol" and "wtf"), if you disregard the span of the match.

IME at least, a common use of assertions is at the top-level of a regexp, which is often equivalent to a conjunction of regexps. The whole idea of an assertion is, like conjunction, that you want to check if two different regexps match. Each has things they can do that the other can't, though. (This should answer @glennsl too.)

lambda-fairy commented 10 years ago

Are the number of codepoints read in a fully case folded character class always bounded?

It took a bit of searching, but the answer is "yes":

The Case_Folding property value is limited so that no string when case folded expands to more than 3× in length (measured in code units).

-- http://www.unicode.org/policies/stability_policy.html

glennsl commented 10 years ago

That's coding in a mechanism for backtracking search, though. At that point, why not introduce backreferences proper? (Potentially with an alternate VM.)

IMO it's simpler, more flexible, lets regexes be proper regular expressions and makes the cost explicit. Not that there aren't benefits to having backreferences and generalized assertions embedded, but not enough I think.

nickdesaulniers commented 10 years ago

Mmm nah nah nah nah: https://github.com/nickdesaulniers/rust-re2

lambda-fairy commented 10 years ago

@nickdesaulniers may be joking, but he has a point. Why don't we just bundle re2?

It's well-written and -maintained, and fits our requirements (predictable, decent Unicode support, fast). It isn't a heavy dependency either -- IIRC it uses a few tricks to keep the library small.

I'm not asking @glennsl, @ferristseng et al. to stop what they're doing -- a pure Rust engine would be great, for reasons we all know already. But this issue has been open for more than a year without progress. We need something that does regexes, and at least for now, re2 can be that something.

thestinger commented 10 years ago

RE2 will pull in a dependency on C++ again.

Statically checked/compiled regular expressions able to be optimized as part of the rest of the Rust code via syntax extensions seem like a much better default. It's a selling point of D, and dynamically compiled regexp isn't going to compete in the majority of cases where it's a static expression.

erickt commented 10 years ago

re2 is great, but do we need it (or pcre or any other regex engine) in rust proper? The main advantage I see of having a regex engine in rust proper would be for statically compiling the expression, but assuming #11151 lands, even that won't be necessary.

pnkfelix commented 10 years ago

assigning P-low.

thestinger commented 10 years ago

It's no longer necessary for this to be in the standard library to support static compilation of regular expressions. As this is obviously something with many different approaches, could we close this issue?

alexcrichton commented 10 years ago

This issue seems to be properly categorized, and I think that this is a good issue for tracking official support of a particular regular expression library. For now I think this is fine to leave open, although the title will need to change as libextra is probably no longer the target for such a library (but rather as its own crate).

mneumann commented 10 years ago

@thestinger : I think every good language needs to ship with a standard regexp engine, so should Rust. It's also a strong argument for all folks coming from scripting languages like Ruby. If they see the language has no support for regexp's out of the box, they will probably be very disappointed. It's just too useful for not coming with Rust.

huonw commented 10 years ago

(Changed the title to reflect extra no longer being the target.)

Valloric commented 10 years ago

For Rust to be taken seriously by outsiders, it needs to have regex support in the standard library. Every modern language has it. Even C++ has it now in C++11. Regexes are a core building block for many programs... not having this in std is more than a bit embarrassing, IMO.