seamusabshere / fuzzy_match

Find a needle (a document or record) in a haystack using string similarity and (optionally) regular expression rules. Uses Dice's Coefficient (aka Pair Similiarity) and Levenshtein Distance internally.
MIT License
677 stars 46 forks source link

Unexpected results when specifying stop words as Regexps #20

Open rob99 opened 8 years ago

rob99 commented 8 years ago
FuzzyMatch.new(['AAI Limited', 'LITED'], :stop_words=>['limited']).find('AAI Limited')
=> "AAI Limited"  # good

FuzzyMatch.new(['AAI Limited', 'LITED'], :stop_words=>[/limited/i]).find('AAI Limited')
=> "LITED"  # bad

I would expect the same result in either case, given the absence of special characters in the regexp.

rob99 commented 8 years ago

Also found:

FuzzyMatch.new(['AAI Limited', 'LITED'], :stop_words=>[/limited/]).find('AAI Limited')
=> "AAI Limited"

So the case insensitive modifier seems to be having an undesirable impact...

rlue commented 7 years ago

TL;DR: This was fixed as of 4f914f2 (7/20/2015), but appears not to have been updated on rubygems. To use the most recent version, try the following line in your Gemfile:

gem 'fuzzy_match', :git => 'https://github.com/seamusabshere/fuzzy_match.git'

To be clear, the issue is not with the /i Regexp flag. Rather, [/limited/i] is the only version of the stop word that works! You can try it yourself:

pry(main)> FuzzyMatch.new(['AAI Limited', 'LITED']), stop_words: [/limited/i])
=> #<FuzzyMatch:0x007fd394393388
 ... @haystack=[w("AAI"), w("LITED")], ... >

vs

pry(main)> FuzzyMatch.new(['AAI Limited', 'LITED']), stop_words: [/limited/])
=> #<FuzzyMatch:0x007fd394393388
 ... @haystack=[w("AAI Limited"), w("LITED")], ... >

The problem is that the when you try to #find('AAI Limited'), the old version (the one on rubygems) only filters the stop word out of the ‘haystack’, and not out of the ‘needle’. Thus, applying the stop word makes it search for ‘AAI Limited’ in ‘AAI’ / ‘LITED’, when it should be search for just ‘AAI’.