How to sponsor fixing bugs exposed when searching large files?

gcflymoto commented 3 years ago

I really love "ag" and use it every day for work as a developer, but it has limitations when searching big files. I have put together my own fork with many PRs which haven't made it into this repo, and it gets a lot closer. It includes PCRE2 from #1035 but it still has trouble with big files. Any way I can sponsor somebody and how do I go about it?

Test	Greer	Greer Regex w/PCRE	Gcflymoto	Gcflymoto Regex w/PCRE	Gcflymoto Regex w/PCRE2
Small File	Pass	Pass	Pass	Pass	Pass
Small File.gz	Fail	Fail	Pass	Pass	Pass
Small File.bz2	Fail	Fail	Fail	Fail	Fail
Small File.xz	Fail	Fail	Pass	Pass	Pass
Big File	Fail	Not Supported	Pass	Not Supported	Fail
Big File.gz	Fail	Not Supported	Pass	Not Supported	Fail
Big File.bz2	Fail	Not Supported	Fail	Not Supported	Fail
Big File.xz	Fail	Not Supported	Pass	Not Supported	Fail

HAVE_FOPENCOKKIE disabled because of https://github.com/ggreer/the_silver_searcher/issues/1349#issuecomment-626418179

Much thanks to @aswild @napcode @afredd @mwagnell @jtamboli @thebaptiste for providing the PRs!!

@aswild Add support for libpcre2 (compiles and works) #1035 @aswild fix compile warnings on GCC10 #1398 @aswild search_file: check return code of read for failure #1260 @napcode Fixes overflows for large and compressed files #1221 @afredd Rework is_binary() #204 @mwagnell Segmentation fault when searching an xlsx file with -z option #1349 @jtamboli Include file names & line numbers when searching paths, even if stream is present #1433 @thebaptiste To be able to build with lzma.h missing or zlib.h missing or both mis… #1353

aswild commented 3 years ago

Hey, thanks for the shout-out. One word of warning - most of my open PRs on this repo are pretty old and in most cases I have better versions of them in my own fork. I sorta lost interest in pushing PRs here due to the general trend of only small tweaks being accepted rather than larger features/fixes/refactors.

Anyway, if you want to keep rolling with PCRE2 in your fork, check out https://github.com/aswild/the_silver_searcher/commit/bbd3ee46113e3e7f13a75ba65fa35da06b3fe2d1 where I removed PCRE1 support and got considerable speedup by reducing heap thrashing on every call to pcre2_exec.

As far as searching big files goes, I think that's a fundamental/hard problem in ag. Except when searching stdin in stream mode, the whole file to be searched has to be buffered in memory, either with mmap or by reading it into the heap. There's even a check that the file length isn't greater than INT_MAX due to PCRE limitations*. Naturally, this leads to memory pressure and performance issues on some systems. This is further complicated by the way ag saves matches as pointers into the full file buffer, and since all matches are collected before stdout is mutex-locked, it means the whole buffer has to wait around in memory before any of its matches get printed to the console.

I don't mean to sound discouraging, these aren't insurmountable roadblocks, but they are the product of multiple interlocking facets of ag's design that would require significant effort to refactor.

* I just realized that this limitation isn't needed in PCRE2 and I should probably remove it in my fork.

gcflymoto commented 3 years ago

Hi Allen, wow that's a fantastic repo you have. Yes, I'll attempt to merge in your changes, it will take me a while (vacation tomorrow) but excited to see it perform better.

There's even a check that the file length isn't greater than INT_MAX due to PCRE limitations*I removed that after applying your original PCRE2 patch. but they are the product of multiple interlocking facets of ag's design that would require significant effort to refactorGot it, I'm not sure yet if I would mind if AG had to read the whole file into the heap as long as it free'd it after searching. The majority of my "big" file searches are right over the "big_file" hump and under 20GB. The systems have enough memory to cover putting it in the heap, but I understand if the current design would start triggering limitations when searching large files.

PS. I see good performance gains when generating and consuming profiling results.

Test

ICC 2019

-O3 -ipo -axCORE-AVX2 -xSSE4.2 -qopt-report=4 -qopt-report-phase ipo	ICC 2019 -O3 -ipo -axCORE-AVX2 -xSSE4.2 -ansi-alias -qopt-report=4 -qopt-report-phase ipo -prof-use -prof-dir=$PWD	ICC 2019 -O2 -ipo -axCORE-AVX2 -xSSE4.2 -ansi-alias -qopt-report=4 -qopt-report-phase ipo -prof-use -prof-dir=$PWD	GCC 9.2-O3 -flto -g0 -march=native -mtune=native	GCC 9.2-O2 -flto -g0 -march=native -mtune=native	GCC 9.2-O3 -flto -g0 -march=native -mtune=native -fprofile-use

Regular

0:21.49

0:24.00	0:21.30

0:22.14	0:21.77

0:21.81

Small	0:48.44

0:53.45	0:36.53

0:39.70	0:40.29

0:31.08

Big	1:57.91

1:46.18	1:34.93

1:44.23	1:42.66

1:22.76 |

On Wednesday, March 17, 2021, 5:04:26 PM CDT, Allen Wild @.***> wrote:

Hey, thanks for the shout-out. One word of warning - most of my open PRs on this repo are pretty old and in most cases I have better versions of them in my own fork. I sorta lost interest in pushing PRs here due to the general trend of only small tweaks being accepted rather than larger features/fixes/refactors.

Anyway, if you want to keep rolling with PCRE2 in your fork, check out @.*** where I removed PCRE1 support and got considerable speedup by reducing heap thrashing on every call to pcre2_exec.

As far as searching big files goes, I think that's a fundamental/hard problem in ag. Except when searching stdin in stream mode, the whole file to be searched has to be buffered in memory, either with mmap or by reading it into the heap. There's even a check that the file length isn't greater than INT_MAX due to PCRE limitations*. Naturally, this leads to memory pressure and performance issues on some systems. This is further complicated by the way ag saves matches as pointers into the full file buffer, and since all matches are collected before stdout is mutex-locked, it means the whole buffer has to wait around in memory before any of its matches get printed to the console.

I don't mean to sound discouraging, these aren't insurmountable roadblocks, but they are the product of multiple interlocking facets of ag's design that would require significant effort to refactor.

I just realized that this limitation isn't needed in PCRE2 and I should probably remove it in my fork.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

aswild commented 3 years ago

AFAIK, files are free'd (or munmap'd) when ag is finished with them, i.e. after their results have been printed. But there can be one allocated file per worker thread, and by default there's one worker thread per CPU core. If you have a lot of CPUs and a lot of big files in the directory being searched, things could start adding up. At least with mmap, the kernel can page in/out file data as needed without exploding the heap and potentially thrashing swap.

ggreer / the_silver_searcher

How to sponsor fixing bugs exposed when searching large files? #1437