junegunn / fzf

:cherry_blossom: A command-line fuzzy finder
https://junegunn.github.io/fzf/
MIT License
65.44k stars 2.41k forks source link

Slow with Arabic (esp vs fzy) #2345

Open mustafa0x opened 3 years ago

mustafa0x commented 3 years ago

Info

Problem / Steps to reproduce

I appreciate that this is not a common use case however I have noticed that fzf is many times slower than fzy when searching Arabic (and fzy's results are quite good).

Dataset

wget https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-abstract1.xml.gz
gunzip arwiki-latest-abstract1.xml.gz

fzf

$> time cat arwiki-latest-abstract1.xml | fzf -f 'هذه تجربة' | wc -l
    6427
real  0m3.425s
user  0m16.836s
sys 0m0.624s

fzy

$> time cat arwiki-latest-abstract1.xml | fzy -e 'هذه تجربة' | wc -l
    6479
real  0m0.389s
user  0m1.352s
sys 0m0.267s

+x +i --algo=v1 --literal really helps, but there remains a big difference

$> time cat arwiki-latest-abstract1.xml | fzf +x +i --algo=v1 --literal -f 'هذه تجربة' | wc -l
    6427

real    0m1.925s
user    0m2.761s
sys 0m0.494s
junegunn commented 3 years ago

Thanks for the report.

I'm not familiar with the implementation of fzy, so I can't tell where the performance difference stems from, but here are a few observations.


Looks like most of the time is spent loading up the list. cat arwiki-latest-abstract1.xml | fzf and you'll see that it takes a couple of seconds until the loading is complete. On the other hand, fzy finishes loading much faster. I think the major difference comes from the fact that fzf was inherently designed to support streaming input, while fzy blocks until the input command completes.

for i in $(seq 100); do echo $i; sleep 0.1; done | fzf
for i in $(seq 100); do echo $i; sleep 0.1; done | fzy

find / | fzf
find / | fzy

There is much more work involved to support streaming, so I can imagine fzf is significantly slower than fzy in this scenario.


The search algorithm of fzf is particularly optimized for ASCII data.

Maybe we can apply the same optimization to non-ASCII data when --literal is set, but I haven't tried.

mustafa0x commented 3 years ago

Thank you so much for the reply!

I'm sure streaming is playing a role, but it can't be the major cause, since the difference between fzy and fzf on https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-abstract10.xml.gz is much less drastic.

$> time cat enwiki-latest-abstract10.xml | fzf +x +i --algo=v1 --literal -f 'All_article_disambiguation_pages' | wc -l
    7221
real  0m0.637s
user  0m1.036s
sys 0m0.235s

$> time cat enwiki-latest-abstract10.xml | fzy -e 'All_article_disambiguation_pages' | wc -l
    7222
real  0m0.335s
user  0m0.911s
sys 0m0.238s

Note: +x +i --algo=v1 --literal didn't make a substantial difference with English data, while with Arabic data reduced the time taken several times over.

junegunn commented 3 years ago

Nice observation. So while streaming overhead definitely exists, it isn't the major factor in loading performance.

If a line only contains ASCII characters, fzf can load it as a byte array without any processing. But if it contains a non-ASCII character, fzf loads it as an array of 4-byte integers, unpacking Unicode code points from the original byte array. So there is significant memory and computational overhead.

https://github.com/junegunn/fzf/blob/6654239c94667fefb38d76cfc47b6abf5ced8149/src/util/chars.go#L46-L63

Maybe we can somehow avoid the extra processing as much as possible with some tricks. I'll see if there's any room for improvement when I get some time. (FWIW, I mostly use fzf with ASCII input, so this has never been an issue for me.)

mustafa0x commented 3 years ago

Awesome, thank you! Yeah, it's a bit of an untypical use case. However, all the unique features of fzf make it a powerful multi purpose search tool, so this use case might start becoming a bit more common.

kaddkaka commented 2 years ago

Has there been any more insights into this? It sounds interesting as it might affect search speed in Japanese text as well, for example? :)