rust-lang / regex

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
https://docs.rs/regex
Apache License 2.0
3.52k stars 439 forks source link

investigate using boyer moore (for real) on longer query strings #408

Closed BurntSushi closed 6 years ago

BurntSushi commented 7 years ago

See: https://github.com/BurntSushi/ripgrep/issues/617

ethanpailes commented 6 years ago

I'm not sure that this will ever be worth it. I implemented tuned Boyer Moore in a branch on my fork to see if I could produce a speedup that could live behind a heuristic for pattern length, but the only case where I could get my TBM to win consistently was on the random data and large needle case. Whenever the text looked like natural language or the needle was small, TBM lost.

When I used a needle size of 1000 TBM won on both random bytes and lorum ipsum, but it still lost on Sherlock.

I can't thing of any real use case where you are searching in random seeming data (maybe looking for a specific base64 encoded string in a big blob of base64 data). I would say genetic data is a use case, but according to J Moore BM and friends are not the right choice for that either.

Alternatively, I could have missed some optimizations that seem clearer to someone with more experience writing fast string search algorithms.

BurntSushi commented 6 years ago

I can't thing of any real use case where you are searching in random seeming data (maybe looking for a specific base64 encoded string in a big blob of base64 data).

One case I can think of is when you're searching for a hash in a file that contains a list of hashes.

Another case that might be worth benchmarking is a needle that contains only common characters. There might be a win there with TBM that doesn't use memchr at all (because memchr has some amount of overhead, so if you call it frequently enough, it should be self defeating). But to be clear, this is only my intuition, which could be very wrong!

Alternatively, I could have missed some optimizations that seem clearer to someone with more experience writing fast string search algorithms.

Just from glancing over the code, it looks pretty good to me. Very nice work. Do you want to work on trying to get it merged? We could start conservatively by only using BM when the needle is over a certain size. I'm not sure if I'd want it to be a separate crate just yet, but we could certainly work toward that.

ethanpailes commented 6 years ago

Another case that might be worth benchmarking is a needle that contains only common characters.

That's a good point. I will write a benchmark for that to see if it is a win. Another nice thing about that is that it would be pretty easy to have a factory method that looks at the pattern and switches to TBM if it does not have a sufficiently rare char.

Just from glancing over the code, it looks pretty good to me. Very nice work. Do you want to work on trying to get it merged? We could start conservatively by only using BM when the needle is over a certain size.

Thanks! I would certainly love to get it merged, but only if I'm convinced it will provide a real win. Unfortunately, even with a needle size of 100 memchr still won (or at least didn't loose), so I'm not sure that a length based heuristic would be small enough to be worth the maintenance burden.

I'm not sure if I'd want it to be a separate crate just yet, but we could certainly work toward that.

Obviously, I'm fine with the code going wherever you would prefer. It does seem like a nice eventual result to be able to pull in fast string searching from its own crate, but right now people can just make a Regex out of a literal, so there is not a big need.

It will take me a little to write that benchmark you brought up, but I should have new results for you relatively soon.

ethanpailes commented 6 years ago

I wrote up a benchmark to find the turnover point where TBM starts winning because the pattern contains all uncommon chars. I did a binary search by twiddling the RANK var here to zero in on a good heuristic value. My results are here. I used cargo benchcmp memchr_{RANK} bm_{RANK} to view the results. It looks like a RANK of 242 is a good turnover point.

Avoiding some pathological cases seems like a big enough win to be worth trying to merge this. If you want to move forward I can merge the BoyerMooreSearcher back into src/literal.rs, hide it behind a rank heuristic and then open a PR.

ethanpailes commented 6 years ago

It occurs to me that I should probably also add a minimum needle length to the heuristic (memchr is certainly going to be faster at finding "e" than TBM, even though 'e' is pretty common). I need to do a little more work to figure out what a good value for that is.

BurntSushi commented 6 years ago

@ethanpailes Oh just seeing this now. Do you want me to kill the merge to let you work on it more, or would you prefer to do it in another PR?

ethanpailes commented 6 years ago

@BurntSushi, It looks like a bug in the windows build saved me. I'll add the length heurstic along with the windows fix.

ethanpailes commented 6 years ago

@BurntSushi, is this issue good to be close?

BurntSushi commented 6 years ago

Ah yup! Your investigation was exquisite. :)

ethanpailes commented 6 years ago

Thanks! I definitely wouldn't have felt up to it if it wern't for your awesome maintainer-ship.