non-bitap search returning false positives?

danielfdickinson commented 4 years ago

I'm not sure the test I added located at https://github.com/krisk/Fuse/blob/31fc5b7c963149d568599270b0783cd03db7e7b1/test/fuzzy-search.test.js#L708 is actually quite right.

Should a non-bitap search be returning three matches for this, even with a threshold that is less than the score for the closest match (the apple pie item) but more than the score for the other two matches? Because that is what happens at the moment.

What is the expected behaviour?

krisk commented 4 years ago

The matches are pretty horrid, but the score checks out, so your test is correct:

[
  {
    item: 'Apple pie is a tasty treat that is always best made by mom! But we love store bought too.',
    refIndex: 0,
    matches: [],
    score: 0.784
  },
  {
    item: 'Orange sorbet is just a strange yet satisfying snack.  Chocolate seems to be more of a favourite though.',
    refIndex: 2,
    matches: [],
    score: 0.949685534591195
  },
  {
    item: 'Banana splits are what you want from DQ on a hot day.  But a parfait is even better.',
    refIndex: 1,
    matches: [],
    score: 0.9574468085106383
  }
]

Now that I think about it, I realize that for long patterns, I don't take into consideration the threshold. If I include the threshold, the result will be reduced to the first entry, since it's the only one whose score is less than 0.8 (the threshold in the test case).

I'll add it.

Note that for long patterns, Fuse.js does an ngram-search (as opposed to bitap-search), since if the user is going to type that much, then it means that we can be less forgiving with the searching (i.e, it doesn't have to be as fuzzy). This also helps with performance.

Thanks for flagging this!

krisk commented 4 years ago

Also:

[
  'Apple pie is a tasty treat that is always best made by mom! But we love store bought too.',
  'Banana splits are what you want from DQ on a hot day.  But a parfait is even better.',
  'Orange sorbet is just a strange yet satisfying snack.  Chocolate seems to be more of a favourite though.'
]

Well done 🤣

danielfdickinson commented 4 years ago

Thanks! It's trickier than one might think to come up with text that 'looks natural' but has all the desired characteristics.

Also, thank you for all your hard work on Fuse. Your quick responses and active updates are one reason I quickly abandoned my repo https://github.com/cshore-history/plainfuse based on Fuse only rewritten as ES5 with no webpack etc. The other reason I stopped is that I did some more reading and learning about webpack and the modern ES6 ecosystem and decided that the ES5 rewrite was unnecessary, especially once i figured out how to create unminified Fuse so I could at least do a skim-level audit (call me paranoid if you wish) of the output.

If you want to take a peek at a näive Byteap (vs. Bitap) implementation you can look it https://github.com/cshore-history/plainfuse/tree/experimental-byteap. I suspect it's rather terrible on memory consumption, but if Javascript is one those interpreted languages that takes a performance hit on bitwise operations it could be (made) faster than bitap. (Bitap looks like an algorithm originally written for C and converted to Javascript, but I could be wrong).

Anyway if you're curious what experienced non-Javascript programmer came up with for an ES5 algorithm that sheds bitap's machine word limitation, when doing a quick hack, you're welcome to take a look. I figure the NGram search is probably better (it's got a Jaccard distance; how can you go wrong :wink: ).

Cheers!

krisk commented 4 years ago

Will definitely take a look! Thanks @cshoredaniel!

krisk / Fuse

non-bitap search returning false positives? #362