RegexSet does not benefit from perf-literal

What version of regex are you using?

1.5.6

Describe the bug at a high level.

When building a RegexSet, with the perf-literal feature enabled, aho-corasick isn't used to match literals from different input patterns, even when all input patterns are literals.

From the discussion in #822 I expected that a RegexSet would behave identically to a Regex composed of the same patterns joined with |, except for the caveats re: capture groups discussed there.

What are the steps to reproduce the behavior?

fn main() {
    let mut some_literals: Vec<String> =
        (0..32).into_iter().map(|x| (1 << x).to_string()).collect();
    some_literals.append(
        &mut (["literal1", "literal2", "literal3", "asdf"]
            .iter()
            .map(|&s| s.to_owned())
            .collect::<Vec<String>>()),
    );
    let single_pat = some_literals.clone().join("|");
    let re_set = regex::RegexSetBuilder::new(some_literals.clone()).build().unwrap();
    let lone_re = regex::RegexBuilder::new(&single_pat).build().unwrap();
    println!(
        "{:#?}",
        re_set
            .matches("asdf literal1234")
            .iter()
            .collect::<Vec<_>>()
    );
        println!(
        "{:#?}",
        lone_re
            .captures("asdf literal1234")
            .iter()
            .collect::<Vec<_>>()
    );
}

What is the actual behavior?

Inspecting re_set and lone_re in a debugger shows that in the former, ac == None, while in the latter it's Some(...).

Further, while testing this, I learned that including a single non-literal pattern (in my case, \d+) in the some_literals list prevent it from being used at all. While that might be implied by the documentation, it surprised me.

note:

I ended up at this cursed juncture while trying to understand why a RegexSet I'm building in another project was using around 1.2 GB of RAM to match 12k regexes consisting of around 700kb of literal text with about 10 thousand number-matching capture groups interspersed.

Overall, I think you expect way too much of RegexSet, at least at this current point. You can't really expect to throw 12,000 regexes at something and have it just work without using a lot of memory. It should be workable in the case where they are all literal patterns that we could just defer to Aho-Corasick, and that would be a useful optimization to make. Currently, RegexSet does inhibit some optimizations, but these were decisions I made years ago. I don't plan on revisiting them until #656 is in place.

If you do have all literal patterns then I would encourage you to just use the aho-corasick crate directly.

Further, while testing this, I learned that including a single non-literal pattern (in my case, \d+) in the some_literals list prevent it from being used at all. While that might be implied by the documentation, it surprised me.

Which docs? I don't think there are any docs that discuss the Aho-Corasick optimization or give any guarantees to that effect. (I see that the perf-literal docs discuss the aho-corasick crate, but it's otherwise sufficiently vague that I don't think you can draw any conclusion from it other than "it might help me.")

Also, if you do add a \d+ regex to a list of patterns that are otherwise literals, then yeah, that's going to defeat this specific literal optimization anyway. (By the way, \d is Unicode aware! Which makes it even hard to optimize.) Building an optimization for the case where one of the patterns is \d+ and the rest are literals is I believe possible, but is non-trivial and requires a totally different style of regex engine that intertwines Aho-Corasick with the regex NFA/DFA itself.

Now, if all of your patterns aren't literals but they do all have meaningful literal prefixes, then that is another avenue that could be sped up by literal optimizations.

I ended up at this cursed juncture while trying to understand why a RegexSet I'm building in another project was using around 1.2 GB of RAM to match 12k regexes consisting of around 700kb of literal text with about 10 thousand number-matching capture groups interspersed.

This crate just does not have the sophistication to handle this kind of workload. It may handle it a little better some day, but the fully general case is unlikely to improve much. You might instead consider Hyperscan, which is best in class for this domain.

I'll leave this open though because there are some improvements that can be made here (including your suggestion to use Aho-Corasick when you just have a bunch of literals), but I'm blocking it on #656.

Also, FWIW, the case of "lit1|lit2|...|litN" deferring to Aho-Corasick completely was an optimization I added a while back to try and make a common case of ripgrep faster. It was the end result of modifying the Aho-Corasick algorithm to support leftmost-first matching. (Without that, standard Aho-Corasick won't do the same thing as a regex engine, and thus you can't just use it in place of the regex engine so simply.) I did it in a pretty non-robust way, which is why it only works for a single pattern. A lot about this specific optimization could be improved, including applying it to more cases and to RegexSet. (Right now, it specifically prevents itself from applying to RegexSet.) This optimization also doesn't prevent additional compilation of regex engines internally, even though we may not need them at all.

So overall there is a lot to rethink here and that's part of why I'm working on #656. Reasoning about these optimizations is difficult and the current internal infrastructure isn't set up to be composable at all. The idea with #656 is to lower the barriers of composition and hopefully give more room to experiment and implement optimizations like this.

Overall, I think you expect way too much of RegexSet, at least at this current point. You can't really expect to throw 12,000 regexes at something and have it just work without using a lot of memory.

I thought there was a decent chance it wouldn't work going in, so I'm not entirely surprised.

Also, if you do add a \d+ regex to a list of patterns that are otherwise literals, then yeah, that's going to defeat this specific literal optimization anyway. (By the way, \d is Unicode aware! Which makes it even hard to optimize.) Building an optimization for the case where one of the patterns is \d+ and the rest are literals is I believe possible, but is non-trivial and requires a totally different style of regex engine that intertwines Aho-Corasick with the regex NFA/DFA itself.

That sort of intertwining is what I was hoping would happen, I think.

My actual problem involves matching against a set of strings containing format specifiers, to determine which string was used and extract the value(s) formatted into them. Fortunately, the placeholders almost all represent numbers, or else one of a few hundred fixed strings.

What I've ended up doing instead is breaking up my patterns into "literal fragments" (i.e. splitting them where the format specifiers are), feeding them into aho-corasick with MatchKind::LeftmostLongest and keeping track of which ordered combination of fragments map to each original pattern. Then, I can take the gaps between fragment matches and run the appropriate non-literal parser (e.g. \d+) on it to get the varying data.

This seems to work, and the state size is small enough that I can use build_with_size::<u32, _, _> to get a reasonably compact searcher. heap_bytes() returns about 3.5 MB, and the total program RSS is about 50 MB (to give a more direct comparison to the 1.2 GB number above).

It sounds like the original problem here has been worked around. Otherwise, I'm not sure there's anything actionable here for the regex crate.

rust-lang / regex