rust-lang / regex

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
https://docs.rs/regex
Apache License 2.0
3.56k stars 445 forks source link

Provide a macro for easier matching #111

Closed mdinger closed 8 years ago

mdinger commented 9 years ago

It'd be cool if Regex provided some type of macro so these two are equivalent:

let text = "I categorically deny having triskaidekaphobia.";
let chars = if Regex::new(r"\b\w{13}\b").unwrap().is_match(text) { 13 }
else if Regex::new(r"\b\w{4}\b").unwrap().is_match(text) { 4 }
else { 0 };

let chars = regex_match!("I categorically deny having triskaidekaphobia." {
    r"\b\w{13}\b"? => 13,
    //           ^ Could this be an explicit unwrap?
    r"\b\w{4}\b"? => 4,
    _ => 0,
});

An upscale version could even try to do something with captures on the left being available on the right but I think that might be much more complicated. I don't know that the ? (unwrap()) needs to be there but I thought I'd put it in in case someone stated it was too implicit.

BurntSushi commented 9 years ago

I admit that is kind of cool. My initial instinct is: go ahead! Define one and put it up on crates.io.

I do have a couple concerns:

  1. The implicit unwrap is unfortunate, although I don't think it's totally damning. Most uses of dynamic regexes are Regex::new("...").unwrap() because it is generally considered a programmer error to write an invalid regex.
  2. My most pressing concern: this encourages inefficient matching. Instead of constructing one regex to match on the input, you're build two and scanning the input twice. I believe there exists an API for unioning an arbitrary number of regexes and constructing a machine to match any of them over some input in one scan.

(I suppose it's possible that the macro could be implemented with (2), but I couldn't be sure until the API exists.)

mdinger commented 9 years ago

@BurntSushi You are of course correct about the inefficiency. I would note though that a large regex quickly becomes very unwieldy because of the literal whitespace matching. For example, consider regex rules for rust highlighting in ace. It would be very problematic as a single regex even if it worked (those might be quite problematic if there was a 80 or 100 horizontal character limit enforced somewhere such as is in the rust compiler src).

BurntSushi commented 9 years ago

Sorry---I wasn't clear. Sometimes I don't give enough context, so understandably, you had no idea what I was talking about! :P

What I meant was something like this:

let re = Regex::new("abc").unwrap();
let re2 = re.union(Regex::new("def").unwrap());
let re3 = re2.union(Regex::new("ghi").unwrap());

let m = re3.find("xyz def"); // outputs `1` for the second regex?

Something like that, I mean. This is a really important API, for e.g., efficiently doing string replacements across multiple regexes in a single scan.

Your proposed macro could live on top of this API in theory.

I would note though that a large regex quickly becomes very unwieldy because of the literal whitespace matching.

Thankfully, this is no longer true. :-) An x flag was added a couple months ago. See the second block here for an example: http://doc.rust-lang.org/regex/regex/index.html#example:-replacement-with-named-capture-groups


I think I would be much more comfortable with seeing this macro defined in an external crate first before exposing it in this crate. That said, I do kind of like the idea.

mdinger commented 9 years ago

Cool! Maybe they'll come together in the future. Having said that, regex are complicated and confusing enough as it is. I'm not sure if I like them being modified after the fact or not. It would also depend on how hard issues are to diagnose. Looks cool though.

The x flag is such a great idea!

BurntSushi commented 9 years ago

They wouldn't be modified. They'd probably be taken by value or cloned and used to construct a new machine. RE2 has a similar API for this, although I hope to design something less clunky.

BurntSushi commented 8 years ago

I think it's possible that some matching macros could be good, but I think a reasonably high standard has to be met before they're part of regex proper. (For example, one is published in a separate crate and becomes widely used.) As such, I'm closing this for now.