rust-lang / regex

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
https://docs.rs/regex
Apache License 2.0
3.53k stars 440 forks source link

implement regex engine for handling regexes like `^(<match at most one codepoint>){m,n}$` #802

Open demurgos opened 3 years ago

demurgos commented 3 years ago

What version of regex are you using?

1.5.4

Describe the bug at a high level.

I use regexes to validate string inputs. Usually the strings are fairly small and there are no issues. Today I wanted to accept any text as long as it is shorter than 10000 unicode codepoints. I expected the following regex to work ^(?s:.){0,10000}$.

This triggered a CompiledTooBig(10485760) error instead.

What are the steps to reproduce the behavior?

Code:

fn main() {
    let re = regex::Regex::new(r"^(?s:.){0,10000}$").unwrap();
    dbg!(re);
}

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=bf2062cc98fc1a2c2afe61c88ea0cc86

What is the actual behavior?

Output:

   Compiling playground v0.0.1 (/playground)
    Finished dev [unoptimized + debuginfo] target(s) in 3.51s
     Running `target/debug/playground`
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: CompiledTooBig(10485760)', src/main.rs:2:54
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

What is the expected behavior?

The regex should compile and be reasonably small. It looks like the memory requirements grow very large with the maximum string length checked by this regex.

BurntSushi commented 3 years ago

I'm not quite sure why you're expecting something different here. The error message is pretty clear: the compiled regex is too big. You are free to increase the limit though: https://docs.rs/regex/1.5.4/regex/struct.RegexBuilder.html#method.size_limit

and be reasonably small

But the state machine isn't reasonably small. Firstly, . matches any UTF-8 encoding of any Unicode scalar value. So that on its own bloats the state machine itself. Secondly, the {0,10000} is saying that any number of Unicode scalar values between 0 and 10000 will reach a match state. Even if you minimize the corresponding DFA (as a proxy for thinking concretely about its size, a full DFA isn't actually compiled by this crate), then you're still looking at 80,000+ states. So there is no theoretically better approach using general state machines.

Now in theory, the regex engine could detect this particular variant of a regex and implement something more optimized that doesn't use a state machine. But it's a narrow enough optimization that it would overall be pretty brittle. I don't know whether it's worth doing. Maybe this formulation is common enough that it's worth specializing this and other similarish patterns.

Finally, you could use (?s-u:.) to match any arbitrary byte. That should decrease the size required quite a bit (possibly an order of magnitude).

Today I wanted to accept any text as long as it is shorter than 10000 unicode codepoints.

A regex is really just the wrong tool for this. It would be soooo much faster to just do s.chars().count() <= 10000. The only reason anyone should use a regex for something like this is if you have no other choice. e.g., The regex is the only interface you have.

demurgos commented 3 years ago

Thank you for your quick reply.

A regex is really just the wrong tool for this. It would be soooo much faster to just do s.chars().count() <= 10000. The only reason anyone should use a regex for something like this is if you have no other choice. e.g., The regex is the only interface you have.

First of all, I definitely agree with you here. The issue came up because I only had regexes as built-in checks for strings in the framework I used. It worked well enough so far, but it will definitely have to change (this is how I will fix my immediate issue).

I still opened the issue because it looks to me like a situation that could be handled more efficiently.

Reading your answer and thinking more about it, it's obvious that expanding {n,m} quantifiers will quickly generate a lot of states. I naively thought it could be collapsed and use a counter... but yes: it is no longer a graph-like state-machine and it would only work for specific cases. Figuring out such optimizations in a general way seems to be like a huge amount of work.

If you feel that the issue still has some merit, you may leave it open; otherwise it's OK for me to close it since it does not seem actionable to me.

BurntSushi commented 3 years ago

I'll leave it open for now, because I do think "regex is the only available/convenient interface" is pretty common, and in that circumstance, it's not too uncommon to want to use that interface to put limits on the lengths of things. I did mark it as a "question" though, because I'm not totally convinced we should do something here. But if there is a simple characterization that covers simple use cases, e.g., of the form ^(<match at most one codepoint>){m,n}$, then I think we could implement that specific case more efficiently without too much trouble. (Famous last words.)

Bounded repetitions are indeed a difficult area for regex engines to deal with. There are some techniques for handling them and they do indeed involve counters of some sort. Generally speaking, it's done by making the NFA simulation more powerful than a state machine. As it stands now, bounded repetitions are really just like macros that do the "obvious" replacement. e.g., x{2,5} is identical to xxx?x?x?.