lizmat / App-Rak

21st century grep / find / ack / ag / rg on steroids
Artistic License 2.0
152 stars 7 forks source link

Performance issue with --pattern-from #32

Open Zer0-Tolerance opened 1 year ago

Zer0-Tolerance commented 1 year ago

Hi Liz,

Another quick fix , when you're using the --pattern-from with regex you're doing a loop with each one of the pattern rather than compiling a single regex with all pattern with | this creates a huge performance penalty because you checking the same file over and over based on the number of line in the input file.
Based on my test on a large CSV 93 MB it takes 294 secs against a 99 regex pattern file vs 84 sec if you convert this as a regex with |. Can you fix this ?

lizmat commented 1 year ago

Just not quite a quick fix, I'm afraid.

OOC, do any of those patterns consist of just looking for a literal string?

lizmat commented 1 year ago

A first stab at merging consecutive regexes into a single regex for multiple patterns

lizmat commented 1 year ago

0.2.14 now in the ecosystem. Please let me know how this affected the performance!

Zer0-Tolerance commented 1 year ago

Great I'll test again and let you know ? All patterns are regex.

Zer0-Tolerance commented 1 year ago

ok with the latest version it's better than before 129 sec now (as opposed to 284 sec) for the list of pattern vs 81 sec with the single regex with |.
This is the cmd I'm using: time rak --ignore-case --patterns-from=/tmp/file.txt /path/file.csv

lizmat commented 1 year ago

ok, that's better. By any chance, are one or more of the regexes literal? Consisting of just / foo / ? Because those are currently converted internally to *.contains("foo"), and thus create multiple regexes, instead of just a single one. Which would explain the remaining difference

Zer0-Tolerance commented 1 year ago

Sorry I forgot to add the type to the cmd so no pattern is interpreted as literal in my list right ? time rak --ignore-case --type=regex --patterns-from=/tmp/file.txt /path/file.csv

lizmat commented 1 year ago

With a literal regex I mean / foo / where "foo" consists of only alphanumeric characters. I doubt that if you don't change the file, and then do a --type=regex that it would produce the right results, or even run at all??

Zer0-Tolerance commented 1 year ago

my patterns consist of IP addresses , filenames.

lizmat commented 1 year ago

Could you give me an example of the IP address and a filename?

Zer0-Tolerance commented 1 year ago

sure "8.8.8.8" and "test.exe" for example.

lizmat commented 1 year ago

When you have 8.8.8.8 in your regex, you actually mean the string "8.8.8.8", right? But "8a8b8c8" you would not want to match, right?

If that is so, why are you using regexes? Why not just the string "8.8.8.8" (which would do a much cheaper '.contains("8.8.8.8")' on the targets.

Zer0-Tolerance commented 1 year ago

yes but I have also tests?.exe in the same list of pattern.

lizmat commented 1 year ago

ok, then I suggest grouping the ones with wildcards together, and see how that affects the performance. As the logic now will group together any consecutive regexes into a single regex

lizmat commented 3 weeks ago

FWIW, I'm keeping this one open until it's time to rewrite the expression parsing using RakuAST, and optimize from there

lizmat commented 6 days ago

This has now happened with 0.3.4 https://raku.land/zef:lizmat/App::Rak/changes?v=0.3.4 . This now builds a single Callable for all of the patterns found using RakuAST, not doing anything special for regexes yet.

Please confirm whether this makes things better or worse in your use case.