p-ranav / hypergrep

Recursively search directories for a regex pattern
MIT License
180 stars 5 forks source link

Update benchmark with ugrep 6.1 which is faster than hypergrep #44

Open genivia-inc opened 10 months ago

genivia-inc commented 10 months ago

Nice project!

I came across it because a ugrep user found a Reddit post about the project. Thanks for including ugrep in the benchmarks. It is always interesting how ugrep can be used in comparison. If there are some bottlenecks to speed, then I put it on the TODO list to resolve (see my note below).

The ugrep version used in these benchmark isn't up to date. More optimal search methods and optimizations were implemented in ugrep 3.12 and continued with 4.0. You may want to run the benchmarks with the latest ugrep e.g. v4.0.5. The old version 3.11 had an issue picking the wrong search algorithm in some cases, leading to a fall back to the default method that is slower.

Note that there are also plans for enhancing ugrep's search speed further with some tricks other grep use, as well as adding a new method to deal with pattern cases that ugrep doesn't yet handle well (long initial wildcard characters like .*foo that's not optimized at all but is easy to do by looking for foo), but these plans were (and some still are) delayed to give priority to new and improved features while ensuring reliability. I won't give a list of things that were added over time (since this is not about ugrep, but about hypergrep). Simply getting high performance with lots of new features is a tricky combination to do at the same time.

People always ask about new features or to improve them, they care a bit less about speed. Most often if they really want speed, then the its usually to search "cold" file systems. Then qgrep might be a good choice. Or perhaps ugrep-indexer.

I am curious, is hypergrep based on hyperscan? EDIT: sorry, dumb question, I see the answer in the README up front now. I wanted to ask, because hypergrep has a "simple grep" example that works OK, but not stellar.

p-ranav commented 10 months ago

Hello!

Yes, I'd be happy to update my benchmarks using the latest ugrep v4.0.5. I'm excited for your plans to further speed up ugrep! I agree, it's tricky balancing features with performance. New features often add performance penalties.

And yes, hypergrep uses Intel HyperScan as its regex engine. I suspect their simple example doesn't particularly optimize for large file search or indeed git repositories.

genivia-inc commented 10 months ago

Yes. Simple grep is just a demo application. Large files are searched very quickly. But it lacks a lot of features to make it worthwhile to compare. There is no line counting overhead whatsoever, so it is definitely fast.

I don't know if you're interested in any of the following details, I just wanted to convey ongoing efforts and short-term plans.

Motivation: I am very interested and curious in figuring out new ways to speed up pattern matching algorithmically and try it out in practice. Ugrep was born because of this about four years ago (unoptimized) to try something new I came up with, which I call PM for "predict match". I got ugrep going based on a demo grep I wrote for RE/flex. I want to try hypergrep instead of simple grep to see where we stand now and what needs to be tweaked in ugrep, besides comparing it to some other optimizations that are/were planned, but haven't been released yet.

Background: I've extensively used simple grep to evaluate and compare the speed of matching/search methods. Some early R&D work I did on a new regex search method focussed primarily on classes of patterns that involve short and long strings in growing numbers (e.g. set of words of various sizes). The bitap method and derivatives are really poor at this when its window is short since it is limited to the shortest word in the set. I spent time to find better ways to see if it is possible to compete with hyperscan and other grep tools, and eventually found that PM-4 is nice and superior when matching words up to sets of a 2k or so and does not need SIMD to run fast. It's using a Bloom filter of sorts for short 1 char to 4 char word patterns simultaneously with bit banging. This is combined with bitap, depending on a formula that looks at the entropy of a pattern to match to decide what method should be used. When this combination is used, it is faster than hyperscan (i.e. simple grep) by predicting at which positions a match may occur.

What's next: It might be that other algorithmic combinations implemented in SIMD (AVX and AArch64) are better, or not. I've tried bitap-based methods in pure AVX2 with the 256 byte table in vectors to have zero memory loads except for input, but it ran about the same time as the scalar version. The latest ugrep v4 uses a very different way to match patterns using AVX and AArch64 instructions that are picked by the decision logic to execute when (roughly) deemed best for matching.

Anyway, I will try hyperscan in about two/three weeks (I am on a break) and see what happens.

genivia-inc commented 10 months ago

I just wanted to drop a short note that after a bit of tweaking and adding two common grep-like optimizations, the upcoming ugrep v4.1 release is pretty fast on x64 and arm64 and almost always faster than other fast grep tools, see a preview report posted here: https://github.com/Genivia/ugrep/issues/289

The difficulty is sometimes picking the number of threads and the search method to use that's best.

Finally, it might be again important to point out for your benchmarks that the \w+foo pattern is slow in ugrep as I've already indicated, but those optimizations have to wait until I'm back in office.

PS. I hope you have a great week and happy coding!

genivia-inc commented 9 months ago

OK. As promised, we dropped a new ugrep v4.3 with its performance report showing that ugrep is nearly always faster than other grep on x64 and arm64 systems. I have not been able to compare against hypergrep because it doesn't compile #45.

Another update v4.3.2 will be available soon to speed up recursive search further with improved thread scheduling.

Please note that there is one case of slow regex patterns \w+foo that suffer backtracking will be attacked next for a future release update (it works alright with option -P). To use them specifically in a benchmark makes very little sense at the moment, although \w*foo can be used instead which doesn't have this pathological backtracking issue but also isn't as fast as can be. Regex search and matching is done with my regex DFA project RE/flex, which is pretty fast for regex matching, but search is a different beast that "eats" heuristics, approximate matching with hashing/filters//bitap, SIMD, etc.

genivia-inc commented 4 months ago

FYI. ugrep 5.0 was released some time ago, an important milestone.

This release fixes the performance problems with "leading wildcard" patterns such as \w+foo and many other patterns like this that caused excessive backtracking in previous ugrep releases. We did more benchmarking to test and compare performance. See updated performance report

genivia-inc commented 3 weeks ago

FYI. ugrep 6.1 was released and is faster than hypergrep for several test cases we tried.

AngryLoki commented 1 week ago

@genivia-inc, I checked the latest commit of hypergrep (with stable vectorscan) against latest commit of ugrep, ugrep was slower.

When searching in /usr, number of results was different, as hypergrep has relaxed implementation of https://github.com/p-ranav/hypergrep/blob/master/src/is_binary.cpp (and hypergrep can't search binary files due to hardcode), but hypergrep scanned more files there and still was 3 times faster.

Vectorscan was compiled with BUILD_AVX512VBMI=ON and tested in AMD 7950x3d. Knowing how Vectorscan works (same people contributed to simdjson), I can hardly believe that SIMD-accelerated pattern matching algorithm could be slower than serial algorithm (there definitely might be cases when regex is inefficient on particular engine though).

I think it is worth to compare RE-flex against vectorscan. Back in 2019 when paper https://www.usenix.org/system/files/nsdi19-wang-xiang.pdf was released, on one dataset hyperscan was 183.3x faster than PCRE2.

time sudo ./ug -r -I -cm1, "(https?|ftp)://[^\s/$.?#].[^\s]*" /usr | wc -l
123906

real    0m1.698s
user    0m0.131s
sys     0m0.166s
time sudo ./hgrep -c "(https?|ftp)://[^\s/$.?#].[^\s]*" /usr | wc -l
124137

real    0m0.517s
user    0m0.036s
sys     0m0.034s
time sudo ./ug -r -I -cm1, "testtest" /usr | wc -l
10

real    0m0.783s
user    0m0.027s
sys     0m0.018s
time sudo ./hgrep -c "testtest" /usr | wc -l
10

real    0m0.495s
user    0m0.026s
sys     0m0.020s
time sudo ./ug -r -I -cm1, "(https?|ftp)://[^\s/$.?#].[^\s]*" /usr/src/linux-6.10-rc2/ | wc -l
9277

real    0m0.298s
user    0m0.051s
sys     0m0.040s
time sudo ./hgrep -c "(https?|ftp)://[^\s/$.?#].[^\s]*" /usr/src/linux-6.10-rc2/ | wc -l
9277

real    0m0.186s
user    0m0.040s
sys     0m0.036s
time sudo ./ug -i -r -I -cm1, "linu[xs]" /usr/src/linux-6.10-rc2/ | wc -l
50981

real    0m0.543s
user    0m0.050s
sys     0m0.118s
time sudo ./hgrep -c -i "linu[xs]" /usr/src/linux-6.10-rc2/ | wc -l
50981

real    0m0.177s
user    0m0.032s
sys     0m0.032s
genivia-inc commented 1 week ago

Some observations:

All runs above are with recursive searching and they are all on large file systems only. Not what a typical user would do frequently. These are also likely to contain a lot of binary files (to skip, so performance may depend more on how that is handled rather than pattern search). These recursive searches are subject to a lot more unknown variables on how things actually perform. To avoid this kind of selective benchmarking, I had set up a controlled experimental environment in the ugrep benchmark repo with many different search scenarios that are typical grepping that people can download and try themselves.

For the first test, are you sure hgrep isn't matching newlines in the pattern too, so the number of matches you get is larger? The \s is a Unicode space, including \n. But ugrep implicitly excludes \n to be compatible with GNU grep, unless a pattern explicitly includes \n to match.

In your comparison, there is no way to tell if files aren't evicted from the file cache before a run or during a run. Runs are intended to measure search speed instead of IO latency that is unpredictable. Sometimes one can get lucky or not and even have it reproducible. Recursive search performance also depends on the allocation of buffer memory and worker threads, which in turn depends a lot on the type of CPU and machine you're using. Ugrep tries to make a guess how many workers should be assigned, but this is not perfect. Despite that, recursive searching easily leads to a 2x to 3x difference without anything more specifically optimized. You see this with ripgrep recursive searching that is slower by 2x or 3x compared to ugrep as the benchmarks show for Swift repo search but not for SSL repo search, so it's not consistent and that is fine. I picked benchmark tests not to benefit ugrep, but to compare.

Search optimization with SIMD is mostly pattern prediction, not regex matching with DFA. Simple vector operations to predict a possible match with byte comparisons and AND/OR logic are quick and many times faster than classic regex pattern matching with DFA and non-SIMD methods.

AngryLoki commented 1 week ago

@genivia-inc , thank you for quick response!

Regarding \n, noted, but it does not affect my regex (I used [^\n] and effectively every match was single line). But hgrep -c "test\ntest" matches test<newline>test content as 1, while egrep as 2 (egrep -z as 1).

Well, with few more tests (launched few times for warmup) I confirmed my theory that vectorscan is fast when searching needles in haystack:

time ./hgrep -c -F -f /src/ugrep-benchmarks/words/4.txt /src/ugrep-benchmarks/corpi/enwik8
/src/ugrep-benchmarks/corpi/enwik8:13975
real    0m0.112s
user    0m0.128s
sys     0m0.030s

time ./ug  -c -F -f /src/ugrep-benchmarks/words/4.txt /src/ugrep-benchmarks/corpi/enwik8
13975
real    0m0.189s
user    0m0.152s
sys     0m0.040s

While time explodes when counting characters (effectively killing the idea of SIMD):

time ./hgrep -c -e . /src/ugrep-benchmarks/corpi/enwik8
/src/ugrep-benchmarks/corpi/enwik8:919074
real    0m5.491s
user    0m8.723s
sys     0m0.797s

time ./ug  -c -e . /src/ugrep-benchmarks/corpi/enwik8
919074
real    0m0.069s
user    0m0.054s
sys     0m0.017s

And vectorscan fails with Error compiling pattern: Pattern is too large for simple expressions like a{1,127} due to limitations - https://intel.github.io/hyperscan/dev-reference/compilation.html#supported-constructs ...

genivia-inc commented 2 days ago

Good point. You may have noticed by now that my intent was to poke a bit with the change in the title of this issue to remind you to update your README as you may had in mind many months ago. The comparison is so out of date (for almost a year) and includes some known bad regex search cases for ugrep. It's also including known bad cases for ugrep that were addressed long ago.

Ugrep is faster in cases and in other cases it is not. I tested on a machine that supports AVX2. I don't test AVX512BW which may actually be slower. I know why and others pointed that out too, but it is not a priority at this time because that CPU is not as widely used as SSE/AVX2 and AArch64.

Well, with few more tests (launched few times for warmup) I confirmed my theory that vectorscan is fast when searching needles in haystack:

Ugrep also uses SIMD in this case but uses a different approach to find "needles" in a way that should give more stable performance for a wide range of regex patterns and words to search (for ugrep there is no difference between these). In addition, then a quick bloom filter check to improve match prediction, which is even more important when some regex pattern matches are short, e.g. one, two or three characters. Also large word sets to search give stable performance numbers (>1000 words to search) for which SIMD becomes less useful or is no longer applicable. No sudden slow downs to minutes instead of a fraction of a second.

To comment on the last reply: the time difference observed in your first example may not be due to SIMD, but more likely dependent on 1) buffer size as system time is 40ms higher and 2) the SIMD match method used. Even the exact same method depends on a heuristic choice of needles to find, which works very well in general, but there is variation because, well it's heuristic so never always perfect. Hence, one will find regex patterns that work better for one tool but not for another. Recursive searching is even less dependent on SIMD. Directory traversal and order and worker thread assignment to load balance are greater factors too. Very difficult to generically optimize for all OS and CPUs to always run faster.