tonytonyjan / jaro_winkler

Ruby & C implementation of Jaro-Winkler distance algorithm which supports UTF-8 string.
MIT License
195 stars 29 forks source link

Odd and low distance #13

Closed mhenrixon closed 6 years ago

mhenrixon commented 8 years ago

I really expected the following to return something higher. Anything I can do to adjust this? I would like to have an adjustment table where I can just skip on the accents. Don't mind creating a pull request for it but not sure if this would be something you are interested in?

Saw some TODO about custom adjustment tables.

JaroWinkler.distance(
 'Áedán', 'Aedan',
  adj_table: true,
  ignore_case: true,
  weight: 0.2,
  threshold: 0.7
) 
# =>  0.733333
jogaco commented 8 years ago

You can replace diacritics before computing the distance:

  def self.replace_diacritics!(text)
    if text
      text.gsub!(/[áÁàÀâÂäÄåÅ]/, 'a')
      text.gsub!(/[éÉèÈêÊëË]/, 'e')
      text.gsub!(/[íÍìÌîÎïÏ]/, 'i')
      text.gsub!(/[óÓòÒôÔöÖ]/, 'o')
      text.gsub!(/[úÚùÙûÛüÜ]/, 'u')
    end
    text
  end
tonytonyjan commented 7 years ago

Maybe we can make the default adjusting table more complete to support most languages, but I have no idea where to start, any idea?