lizmat / App-Rak

21st century grep / find / ack / ag / rg on steroids
https://raku.land/zef:lizmat/App::Rak
Artistic License 2.0
153 stars 8 forks source link

Does not complete in time when the needle is large #21

Closed samebchase closed 2 years ago

samebchase commented 2 years ago

Hey,

I've been trying our rak as a replacement for rg etc. and putting it through its paces.

On some folders I have lying around, it is working fast enough for small needles, but when the needle is large it takes too long.

Large needle.

❯ time rak '/ abcdefghijklmnop /'
^C
________________________________________________________
Executed in   69.19 secs    fish           external
   usr time   71.08 secs  126.00 micros   71.08 secs
   sys time    0.50 secs  749.00 micros    0.50 secs

Small needle.

❯ time rak '/ a /'
________________________________________________________
Executed in    1.53 secs    fish           external
   usr time    3.32 secs  110.00 micros    3.32 secs
   sys time    0.23 secs  685.00 micros    0.23 secs

And an unfair comparision with rg just for kicks.

❯ time rg abcdefghijklmnop

________________________________________________________
Executed in   24.45 millis    fish           external
   usr time   22.26 millis   93.00 micros   22.17 millis
   sys time   33.71 millis  618.00 micros   33.09 millis

Unfortunately, I cannot share the data it was being run on. Will try and see if I can replicate it with other data.

I could spend some more time checking this after digging through the code. At this point, I'm not sure if this is an issue with rak or the Raku regex engine. I'd imagine for a simple string the regex engine should go into the usual string search algos. Not sure why it is taking so long.

lizmat commented 2 years ago

Thanks for trying rak!

I think you're comparing apples with oranges here, if I understand the rg syntax correctly as looking for a literal string.

Could you do a time rak abcdefghijklmnop ? That should be a lot better.

Regexes are notoriously slow in Raku, unfortunately. It does not recognize that you're looking for a literal string. Also, it does searches on a grapheme level (which the literal string search in Raku also does, by the way).

Also: how many cores do you have available there? I note that the small needle used about 2x as much CPU as wallclock, but the large needle did not. I wonder why.

lizmat commented 2 years ago

Also, if you just want to know the files in which the matches occur, you could also try --per-file. This would remove the overhead of splitting into lines at the expense of needing the whole file in memory always. YMMV.

lizmat commented 2 years ago

I just found a bug in the default file selection logic, which basically caused it to search all files, instead of just the ones with known extensions. This could also be a reason for the difference in performance that you saw.

Just uploaded version 0.1.10 with a fix.

samebchase commented 2 years ago

@lizmat oh wow, let me try out the new one and report back.

Zer0-Tolerance commented 2 years ago

Hi , Just did a quick comparison as well on a 250mb file text file containing only IPv4 addresses : time rak 111 /ipv4/IPv4-3x-9x.txt => 60 secs time rg 111 /ipv4/IPv4-3x-9x.txt => 4 secs

lizmat commented 2 years ago

15x slower. On a single file. I'd say that is pretty good considering that rak is doing grapheme based searches.

Could you try with --encoding=latin1?

lizmat commented 2 years ago

Also, I'd like to see the full time output, including CPU used :-)

Zer0-Tolerance commented 2 years ago

here you go:

time rak 111 /ipv4/IPv4-3x-9x.txt
________________________________________________________
Executed in   61,59 secs    fish           external
   usr time   59,67 secs  308,00 micros   59,67 secs
   sys time    1,32 secs  870,00 micros    1,32 secs
time rak 111 --encoding=latin1 /ipv4/IPv4-3x-9x.txt
________________________________________________________
Executed in   30,41 secs    fish           external
   usr time   28,43 secs    0,31 millis   28,43 secs
   sys time    1,00 secs    1,36 millis    0,99 secs
lizmat commented 2 years ago

Nice, so in that case only 7x as slow :-)

I guess if people don't like the NFG semantics, they could add --encoding=latin1 as default.

Zer0-Tolerance commented 2 years ago
rak 111 /ipv4/IPv4-3x-9x.txt  66,50s user 1,48s system 98% cpu 1:08,90 total
rak 111 --encoding=latin1 /ipv4/IPv4-3x-9x.txt  27,53s user 0,86s system 97% cpu 29,143 total

not sure I use the time you are expecting , this was one used from zsh on OSX MacBook Pro 2019 quad core

lizmat commented 2 years ago

can this be closed now?

Zer0-Tolerance commented 2 years ago

to me yes but I don't own this issue