ggreer / the_silver_searcher

A code-searching tool similar to ack, but faster.
http://geoff.greer.fm/ag/
Apache License 2.0
26.19k stars 1.43k forks source link

How to sponsor fixing bugs exposed when searching large files? #1437

Open gcflymoto opened 3 years ago

gcflymoto commented 3 years ago

I really love "ag" and use it every day for work as a developer, but it has limitations when searching big files. I have put together my own fork with many PRs which haven't made it into this repo, and it gets a lot closer. It includes PCRE2 from #1035 but it still has trouble with big files. Any way I can sponsor somebody and how do I go about it?

Test Greer Greer Regex w/PCRE Gcflymoto Gcflymoto Regex w/PCRE Gcflymoto Regex w/PCRE2
Small File Pass Pass Pass Pass Pass
Small File.gz Fail Fail Pass Pass Pass
Small File.bz2 Fail Fail Fail Fail Fail
Small File.xz Fail Fail Pass Pass Pass
Big File Fail Not Supported Pass Not Supported Fail
Big File.gz Fail Not Supported Pass Not Supported Fail
Big File.bz2 Fail Not Supported Fail Not Supported Fail
Big File.xz Fail Not Supported Pass Not Supported Fail

Much thanks to @aswild @napcode @afredd @mwagnell @jtamboli @thebaptiste for providing the PRs!!

@aswild Add support for libpcre2 (compiles and works) #1035 @aswild fix compile warnings on GCC10 #1398 @aswild search_file: check return code of read for failure #1260 @napcode Fixes overflows for large and compressed files #1221 @afredd Rework is_binary() #204 @mwagnell Segmentation fault when searching an xlsx file with -z option #1349 @jtamboli Include file names & line numbers when searching paths, even if stream is present #1433 @thebaptiste To be able to build with lzma.h missing or zlib.h missing or both mis… #1353

aswild commented 3 years ago

Hey, thanks for the shout-out. One word of warning - most of my open PRs on this repo are pretty old and in most cases I have better versions of them in my own fork. I sorta lost interest in pushing PRs here due to the general trend of only small tweaks being accepted rather than larger features/fixes/refactors.

Anyway, if you want to keep rolling with PCRE2 in your fork, check out https://github.com/aswild/the_silver_searcher/commit/bbd3ee46113e3e7f13a75ba65fa35da06b3fe2d1 where I removed PCRE1 support and got considerable speedup by reducing heap thrashing on every call to pcre2_exec.

As far as searching big files goes, I think that's a fundamental/hard problem in ag. Except when searching stdin in stream mode, the whole file to be searched has to be buffered in memory, either with mmap or by reading it into the heap. There's even a check that the file length isn't greater than INT_MAX due to PCRE limitations*. Naturally, this leads to memory pressure and performance issues on some systems. This is further complicated by the way ag saves matches as pointers into the full file buffer, and since all matches are collected before stdout is mutex-locked, it means the whole buffer has to wait around in memory before any of its matches get printed to the console.

I don't mean to sound discouraging, these aren't insurmountable roadblocks, but they are the product of multiple interlocking facets of ag's design that would require significant effort to refactor.

* I just realized that this limitation isn't needed in PCRE2 and I should probably remove it in my fork.

gcflymoto commented 3 years ago

Hi Allen, wow that's a fantastic repo you have. Yes, I'll attempt to merge in your changes, it will take me a while (vacation tomorrow) but excited to see it perform better.

There's even a check that the file length isn't greater than INT_MAX due to PCRE limitations*I removed that after applying your original PCRE2 patch. but they are the product of multiple interlocking facets of ag's design that would require significant effort to refactorGot it, I'm not sure yet if I would mind if AG had to read the whole file into the heap as long as it free'd it after searching. The majority of my "big" file searches are right over the "big_file" hump and under 20GB. The systems have enough memory to cover putting it in the heap, but I understand if the current design would start triggering limitations when searching large files.

PS. I see good performance gains when generating and consuming profiling results.

 Test

ICC 2019

-O3 -ipo -axCORE-AVX2 -xSSE4.2 -qopt-report=4 -qopt-report-phase ipo ICC 2019 -O3 -ipo -axCORE-AVX2 -xSSE4.2 -ansi-alias -qopt-report=4 -qopt-report-phase ipo -prof-use -prof-dir=$PWD ICC 2019 -O2 -ipo -axCORE-AVX2 -xSSE4.2 -ansi-alias -qopt-report=4 -qopt-report-phase ipo -prof-use -prof-dir=$PWD GCC 9.2-O3 -flto -g0 -march=native -mtune=native GCC 9.2-O2 -flto -g0 -march=native -mtune=native GCC 9.2-O3 -flto -g0 -march=native -mtune=native -fprofile-use

Regular

0:21.49
0:24.00 0:21.30
0:22.14 0:21.77
0:21.81
Small 0:48.44
0:53.45 0:36.53
0:39.70 0:40.29
0:31.08
Big 1:57.91
1:46.18 1:34.93
1:44.23 1:42.66

1:22.76 |

On Wednesday, March 17, 2021, 5:04:26 PM CDT, Allen Wild @.***> wrote:

Hey, thanks for the shout-out. One word of warning - most of my open PRs on this repo are pretty old and in most cases I have better versions of them in my own fork. I sorta lost interest in pushing PRs here due to the general trend of only small tweaks being accepted rather than larger features/fixes/refactors.

Anyway, if you want to keep rolling with PCRE2 in your fork, check out @.*** where I removed PCRE1 support and got considerable speedup by reducing heap thrashing on every call to pcre2_exec.

As far as searching big files goes, I think that's a fundamental/hard problem in ag. Except when searching stdin in stream mode, the whole file to be searched has to be buffered in memory, either with mmap or by reading it into the heap. There's even a check that the file length isn't greater than INT_MAX due to PCRE limitations*. Naturally, this leads to memory pressure and performance issues on some systems. This is further complicated by the way ag saves matches as pointers into the full file buffer, and since all matches are collected before stdout is mutex-locked, it means the whole buffer has to wait around in memory before any of its matches get printed to the console.

I don't mean to sound discouraging, these aren't insurmountable roadblocks, but they are the product of multiple interlocking facets of ag's design that would require significant effort to refactor.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

aswild commented 3 years ago

AFAIK, files are free'd (or munmap'd) when ag is finished with them, i.e. after their results have been printed. But there can be one allocated file per worker thread, and by default there's one worker thread per CPU core. If you have a lot of CPUs and a lot of big files in the directory being searched, things could start adding up. At least with mmap, the kernel can page in/out file data as needed without exploding the heap and potentially thrashing swap.