lizmat / App-Rak

21st century grep / find / ack / ag / rg on steroids
Artistic License 2.0
152 stars 7 forks source link

Does not complete in time when the needle is large #21

Closed samebchase closed 1 year ago

samebchase commented 1 year ago

Hey,

I've been trying our rak as a replacement for rg etc. and putting it through its paces.

On some folders I have lying around, it is working fast enough for small needles, but when the needle is large it takes too long.

Large needle.

❯ time rak '/ abcdefghijklmnop /'
^C
________________________________________________________
Executed in   69.19 secs    fish           external
   usr time   71.08 secs  126.00 micros   71.08 secs
   sys time    0.50 secs  749.00 micros    0.50 secs

Small needle.

❯ time rak '/ a /'
________________________________________________________
Executed in    1.53 secs    fish           external
   usr time    3.32 secs  110.00 micros    3.32 secs
   sys time    0.23 secs  685.00 micros    0.23 secs

And an unfair comparision with rg just for kicks.

❯ time rg abcdefghijklmnop

________________________________________________________
Executed in   24.45 millis    fish           external
   usr time   22.26 millis   93.00 micros   22.17 millis
   sys time   33.71 millis  618.00 micros   33.09 millis

Unfortunately, I cannot share the data it was being run on. Will try and see if I can replicate it with other data.

I could spend some more time checking this after digging through the code. At this point, I'm not sure if this is an issue with rak or the Raku regex engine. I'd imagine for a simple string the regex engine should go into the usual string search algos. Not sure why it is taking so long.

lizmat commented 1 year ago

Thanks for trying rak!

I think you're comparing apples with oranges here, if I understand the rg syntax correctly as looking for a literal string.

Could you do a time rak abcdefghijklmnop ? That should be a lot better.

Regexes are notoriously slow in Raku, unfortunately. It does not recognize that you're looking for a literal string. Also, it does searches on a grapheme level (which the literal string search in Raku also does, by the way).

Also: how many cores do you have available there? I note that the small needle used about 2x as much CPU as wallclock, but the large needle did not. I wonder why.

lizmat commented 1 year ago

Also, if you just want to know the files in which the matches occur, you could also try --per-file. This would remove the overhead of splitting into lines at the expense of needing the whole file in memory always. YMMV.

lizmat commented 1 year ago

I just found a bug in the default file selection logic, which basically caused it to search all files, instead of just the ones with known extensions. This could also be a reason for the difference in performance that you saw.

Just uploaded version 0.1.10 with a fix.

samebchase commented 1 year ago

@lizmat oh wow, let me try out the new one and report back.

Zer0-Tolerance commented 1 year ago

Hi , Just did a quick comparison as well on a 250mb file text file containing only IPv4 addresses : time rak 111 /ipv4/IPv4-3x-9x.txt => 60 secs time rg 111 /ipv4/IPv4-3x-9x.txt => 4 secs

lizmat commented 1 year ago

15x slower. On a single file. I'd say that is pretty good considering that rak is doing grapheme based searches.

Could you try with --encoding=latin1?

lizmat commented 1 year ago

Also, I'd like to see the full time output, including CPU used :-)

Zer0-Tolerance commented 1 year ago

here you go:

time rak 111 /ipv4/IPv4-3x-9x.txt
________________________________________________________
Executed in   61,59 secs    fish           external
   usr time   59,67 secs  308,00 micros   59,67 secs
   sys time    1,32 secs  870,00 micros    1,32 secs
time rak 111 --encoding=latin1 /ipv4/IPv4-3x-9x.txt
________________________________________________________
Executed in   30,41 secs    fish           external
   usr time   28,43 secs    0,31 millis   28,43 secs
   sys time    1,00 secs    1,36 millis    0,99 secs
lizmat commented 1 year ago

Nice, so in that case only 7x as slow :-)

I guess if people don't like the NFG semantics, they could add --encoding=latin1 as default.

Zer0-Tolerance commented 1 year ago
rak 111 /ipv4/IPv4-3x-9x.txt  66,50s user 1,48s system 98% cpu 1:08,90 total
rak 111 --encoding=latin1 /ipv4/IPv4-3x-9x.txt  27,53s user 0,86s system 97% cpu 29,143 total

not sure I use the time you are expecting , this was one used from zsh on OSX MacBook Pro 2019 quad core

lizmat commented 1 year ago

can this be closed now?

Zer0-Tolerance commented 1 year ago

to me yes but I don't own this issue