Search hangs when searching a repeated substring with wildcard

olivernn / lunr.js

A bit like Solr, but much smaller and not as bright

http://lunrjs.com

MIT License

8.96k stars 548 forks source link

Search hangs when searching a repeated substring with wildcard #270

Closed meltuhamy closed 6 years ago

meltuhamy commented 7 years ago

See the jsFiddle here: https://jsfiddle.net/bwe0f44g/

var idx = lunr(function () {
  this.ref('name')
  this.field('text')
  this.add({
  "name": "long",
  "text": "Zlah_testZlah_Zlah_testZlah_testZlah_testZlah_testZlah_testZlah_testZlah_testZlah_testZlah"
});
})

window.search = function search(){
    var t0 = performance.now();
  var result = idx.search('*Zlah*');
  var t1 = performance.now();
  alert('Found '+result.length+' results in ' + (t1-t0) + 'ms');
}

meltuhamy commented 7 years ago

It seems that if the search includes a * at the beginning of the query, the search hangs.

idx.search('*Zlah*') // hangs
idx.search('Zlah*') // ok

olivernn commented 7 years ago

On my machine that search eventually completes after 12 seconds, something is clearly not right though.

The test data looks fairly pathological. Out of interest, how did you discover it?

My guess is that the leading wildcard is causing many iterations through the graph that represents the token, I'll need to take a closer look with a debugger to see where its getting stuck though.

meltuhamy commented 7 years ago

I'm using wildcards to allow substring searches. If there are a lot of repeated items in the result, it will take longer and longer to perform the search. For now, I'm using my own pipeline function to expand all tokens into their substrings so the user never needs to use a wildcard character (I'm using lunr in the context of a loose auto complete search scenario).

olivernn commented 7 years ago

Searches with leading wildcards are considerably more expensive, that said, I wouldn't expect it to take this long. I'll have a look through the relevant code but its a bit involved and I wrote it over a year ago now.

Perhaps there is a better way of implementing what you are trying to achieve though, can you give an example of the kind of documents you are searching within, and the kind of query you are trying? Is string in the document multiple tokens merged together into a string? If I understood what you're trying to achieve better I might be able to suggest a more performant way of implement it.

meltuhamy commented 7 years ago

I'm doing a search auto complete feature where it finds all documents which have a substring of the search term. e.g.

the words

hello world
blah
word
nice
dice

and the search term d

should get the results

hello world
word
dice

because d is a substring of the two reuslts.

To begin with, I was able to achieve this using the wildcard method (i.e. search for *d*) but found the issue above and performance problems.

I'm open to ideas as to what's a better way of doing it, but for now I have a working solution which is a pipeline function to expand all terms to their subsets.

olivernn commented 7 years ago

If you have a solution that is working then thats great. I'll still spend a bit of time trying to understand the particular case you originally posted, perhaps there is some optimisation or bug causing the long run time.

meltuhamy commented 7 years ago

Yes it seems that is quite a nasty bug that is preventing me from using wildcards. Thanks for your help!

nerumo commented 7 years ago

@meltuhamy how exactly did you do the workaround with the pipeline function? I'm afraid that expanding every keyword to subsets would make my index burst.

I currently had to disable leading wildcards because of this issue (>20s with leading wildcard compared to <400ms without).

olivernn commented 6 years ago

I've pushed 2.3.3 which should resolve the long search times with some wildcard searches, let me know if it solves the issue, thanks.

meltuhamy commented 6 years ago

Confirmed this fixes it. Thanks @olivernn !