loilo / Fuse

🔍 Fuzzy search for PHP, ported from Fuse.js
Apache License 2.0
320 stars 30 forks source link

Performance question #18

Closed dsSquirrle closed 5 years ago

dsSquirrle commented 5 years ago

Hello,

I'm not quite sure if this is the right place to ask this, but I have a question about the performance of the search.

I currently have an array of ~81k indexes, with 3 keys each, only searching by one ("Name"). It currently takes ~3.8s to return 0 results (if I search something that doesn't match, and ~4.9s if I search something that does), with these settings: $fuse = new \Fuse\Fuse($locationData, [ "keys" => ["Name"] , "threshold" => 0.2 , "distance" => 10 , "includeScore" => true , "includeMatches" => true ]);

My data (imported as JSON) looks like this: Array ( [0] => Array ( [Name] => ACCOMACK [ListingCount] => 191 [Type] => county ) )

It looks like that all the time simply just comes from the size of the data, as with logging the times, I constantly got ~4s to reach the 9th match in a search. It also looks like that 'Fuse->analyze()' is run twice, as with a simple increment, the 9th match returns that it was the 160663 run of the function. Although, bypassing that (is_string was false, and is_list reactivated analyze as a string) didn't seem to improve the speed at all.

Here's an example of the timings it took to search (only logging when a match was found): http://prntscr.com/n2v14q

Is there a better format I can use for the data, or is this just a limitation of having 81k+ items to search through?

I'm not sure what other information you need, but let me know if I can provide anything else. I don't have any public page set up with this to test it against at the moment, but if you need, I can probably get that set up.

The data that I'm searching through has 5 distinct parts, so I tried also searching through each separately. When logging the time specifically, increased the total time by ~0.3s. When not logging, specifically, reduced the total time by ~0.8s (while also returning 40 more results)

PHP 7.2 Running on a DreamHost server (not sure of specs)

Thanks, Tony.

loilo commented 5 years ago

Hey,

first, thanks for the work you've put into this issue.

I'm afraid I cannot help you directly. Fuse for PHP is a strict 1:1 port of Fuse.js, therefore it inherits all its benefits and limitations. I also don't have any expertise in the computer science behind its search algorithm.

However, since you stated that your data is stored as JSON, you could try to perform the same search with Fuse.js and check on its performance. In my experience, this can be quite a lot faster (because JavaScript VMs are just more advanced than the PHP interpreter I suppose), but it should not be faster by orders of magnitude. If so, this would be most likely a bug in my port.

That said, I could very well imagine that your issues have to do with the size of your data set — fuzzy search over 81k entries sounds like quite a lot of work for the CPU. But again, I'm not an expert on any of this.

So, long story short: I unfortunately cannot help you directly, but I'd advise you to try your search with Fuse.js. If it's a LOT faster, notify me and I'll try to find an according bug in this port. If it's not, feel free to ask for help on performance over at their repository.

If you're not familiar with JavaScript, feel free to hit me up — I'd happily provide you with a binary you can use in the CLI to run a Fuse.js search (essentially a script for Node.js, bundled up with pkg).

dsSquirrle commented 5 years ago

I have tried out the JS version a bit, and the main reason for wanting to try out the PHP version is that it's still a really large dataset to have to send to a user, and this is running on WordPress websites, so no node.js or serverside JS (that I'm aware of).

However, in a quick test, I imported my 81k results into the console on https://fusejs.io/, set up the same options, and did a search. It completed in ~0.284s, so much quicker (and this isn't a good computer) with a 4 character search string. ~0.308s with an 8 character search string, ~0.654s with an 18 character search (2 word) string (with 4 matches).

Here's a sample search from the Fuse.js site: http://prntscr.com/n362ks. Here's the same search in PHP: http://prntscr.com/n364ry (note that ~0.450s is used just as the AJAX call, and ~0.77s to download the content, so still ~14s to search this longer term).

Maybe (as you mentioned) JS is just that much faster than PHP, but I figured you'd be a better source to ask about it than the original Fuse.js team since it does seem more PHP specific in general.

loilo commented 5 years ago

That's in fact a huge discrepancy. However, I could very well imagine this being a JS vs. PHP performance thing, maybe even just in some specific areas relevant to search. At least I couldn't find any hints for a bug on my side. 🤔

dsSquirrle commented 5 years ago

Thanks for looking into it some more, I'm also still looking into it a bit, and looking at just creating my own "fuzzy" search using soundex and levenshtein. I just need to figure out how to score it, but even with using both of those, it's able to compare my string against the 81k results in ~0.4s. It certainly does something, but needs some more logic for "good" searches. Here's what I've set up so far: http://prntscr.com/n3tdox

As for Fuse, here's my current (probably bad) timing logging for the different functions: http://prntscr.com/n3t3lh

As you can (maybe, like I said, probably bad logging) see from the times, "search", "innerSearch", and "analyze" are definitely where the speed decrease is, and "analyze" is being doubled up.

I'm still trying to understand a bit of how this actually works, and it's a bit odd from how I understand it. A search is done, and it goes -> search -> innerSearch -> analyze -> search -> are we now not in a loop? I realize that this is a better question for the Fuse.js team, and they could probably explain it, but for now I'm just trying to narrow down a bit more of where the time comes from, and look at that specifically. If I were smarter I'm sure I could figure it out, but I'm not a developer, so ¯_(ツ)_/¯, we'll see what happens.

I'll update if I find something that I think is wrong, or causing a problem.

loilo commented 5 years ago

For a dataset of 81k records (generated a random one with faker.js), the results are even more extreme for me.

The PHP search takes 13.05 seconds where the Node search is about 0.5.

I'll try to profile the PHP script, but I'm not sure how well it will go, I've got literally zero experience with that.

loilo commented 5 years ago

Okay, so take everything below with a grain of salt — as I said, I'm completely unexperienced on the topic. I think the performance differences here are actually due to performance optimizations Node.js (or rather the V8 engine) does to the code while the PHP runtime does not.

The PHP performance profile indicates that a lot of time has been spent in the score function which is in itself very, very simple and not computation-heavy at all. However, during the search it's called a whopping 914k times:

Screenshot of Fuse search profile

The call count seems to be the crucial factor here. I'll demonstrate this with a small example.

I've written a PHP script that defines a simple function and calls it 100k times.

function inc($num) {
  return $num + 1;
}

$count = 0;
for ($i = 0; $i < 100000; $i++) {
  $count = inc($count);
}

echo $count;

On my machine, this takes 0.11 seconds to run. The equivalent JS running through Node.js takes 0.09 seconds. Nothing fancy so far.

If I increase the call count just by the factor 10 though — from 100,000 to 1,000,000 — the numbers are vastly different: It feels pretty natural that PHP takes roughly 10x the time to run, it's now at 0.84 seconds. The Node.js script however still takes just 0.09s. That means PHP execution times scales linearly with the number of function calls while the execution time for Node.js stays constant.

(Funny enough, this optimization seems to bail out after a certain amount of loop runs. From 10,000,000 calls onward, Node.js scales linearly with the number of calls es well.)

Conclusion: The immense differences we're experiencing with Fuse.js vs Fuse.php actually seem to be owed to the optimization work the JavaScript runtime is able to do while the PHP runtime is not. Given the sheer time and engineering power that is put into V8 by Google, this shouldn't be surprising, however I didn't expect just how extreme this skews the performance in JavaScript's favor.

dsSquirrle commented 5 years ago

Wow, thanks for that, much better information than I was getting haha. I don't know anything about profiling PHP, so I'll look into that as well.

But yeah, I suppose it is mostly just their speed differences, though I feel like it's very odd that Bitap/score.php is called 914k times! I still don't know exactly how Fuse works though, and it does work, so I'm sure there's a reason.

Thanks again for looking into it, I'll just have to try out my own fuzzy search, or I was also looking at lz-string which has a PHP counterpart to compress my 81k results and send it to the browser to use with Fuse.js (which will be a much better experience than having to ajax call back to the site to use something in PHP).

I haven't tried the PHP part yet, but with just the js lz-string, it's able to compress the ~4.6MB JSON to a ~19KB txt file (in ~1.8s). That compression is crazy (~99.6%?), but I can definitely deal with sending that to the browser to parse instead.

I'll close this, and again I appreciate you looking into it.

loilo commented 5 years ago

My pleasure. 🙂

19KB over the wire should be fine. The bigger problem could be weaker CPUs (i.e. cheap smartphones, if your site is supposed to run on those) which have to unpack and index and search all this data again.

That said, I've just cobbled together a very simple levenshtein-based fuzzy search that performs the search on the 81k dataset in 0.37 seconds (as opposed to the 13 seconds for Fuse). It literally has no features beyond fuzzy matching (no preference for exact matches etc.) but it's kinda fast, even in PHP. 😉

function searchByKey($data, $query, $key) {
    return array_values(array_filter($data, function ($item) use ($query, $key) {
        return levenshtein($item[$key], substr($query, 0, strlen($item['Name']))) < 3;
    }));
}

$result = searchByKey($locationData, 'some location, 'Name');

Maybe it can be useful. 🙂

dsSquirrle commented 5 years ago

Thanks, I'll check it out. We'll have to see how it goes with the logic for scoring, and sorting based on quality of a match :p

Also, I think that I'm still going to run into an issue with PHP for the LZCompression... Fatal error: Maximum execution time of 30 seconds exceeded in /LZCompressor/LZUtil.php on line 100

Makes sense, if something in PHP takes 13s, and 0.5s in JS, then something in JS that takes 1.5s probably is going to take 30s+ in PHP (very generally speaking).

ahren-condos-ca commented 4 years ago

Having the same issue but significantly multiplied. Doing very similar name lookup but on nearly 600k users... and lookups are taking +/- 1min 25s each... and i have to process a few thousand rows so even with making these into Jobs... running overnight still only processed 1/3 of my batch (1200ish rows).

Wondering if a work around would be to send me 60M array of usernames to browser and initialize the Fuse library there and then make individual AJAX requests for each row to my laravel backend once name matching is done. Problem with this is if it hit a browser timeout I'm kinda screwed (unless i try implement some "re-try a.k.a continue from last completed row? doesn't seem ideal).

I like the tools results... speed is just killiing me though. Advice?

loilo commented 4 years ago

A data set of 600k entries definitely should be subject to a proper database software. Fuse doesn't even create a search index for the data it holds.

Unfortunately I don't have any proper advice there though. I've even ported Fuse in the first place because existing fuzzy search solutions I've found in database software weren't as sophisticated as I'd have liked them. (Which is pretty natural — the "fuzzier" the search, the harder it is to create a proper search index for it.)

If you're using MySQL and your dataset exclusively contains natural words from the English language, you may try the SOUNDEX approach. My use cases didn't meet these constraints though, therefore I haven't tried this approach and can't tell how well it works.

loilo commented 2 years ago

Coming back to this, just for the record:

I gave the 81k records approach another try, and the situation significantly improved – the search performed in about 550ms. 🥳

This certainly has multiple reasons – I'm on a newer development machine now, I'm running PHP 7.4 – but the reason I tried again is that I just finished the port of Fuse.js v6 which brought great performance improvements that this project benefits from as well. 🙂