threedaymonk / text

Collection of text algorithms. gem install text
Other
586 stars 42 forks source link

Soundex implementation broken #19

Closed yorickpeterse closed 10 years ago

yorickpeterse commented 10 years ago

It appears that the Soundex implementation is broken due to two problems:

  1. Whitespace in the input causes Text::Soundex.soundex to always return nil
  2. The length of the returned soundex codes is not correct.

The first problem can be reproduced as following:

require 'text'

p Text::Soundex.soundex('San Francisco') # => nil

The second problem can be reproduced as following:

require 'text'

p Text::Soundex.soundex('SanFranciscoooooooblalalalalal') # => "S516"

If one were to use the SOUNDEX() function provided by MySQL they would get the following instead:

mysql> SELECT SOUNDEX('SanFranciscoooooooblalalalalal');
+-------------------------------------------+
| SOUNDEX('SanFranciscoooooooblalalalalal') |
+-------------------------------------------+
| S5165214                                  |
+-------------------------------------------+
1 row in set (0.00 sec)

Perhaps this is intended behaviour. In that case it should be documented somewhere.

yorickpeterse commented 10 years ago

Derp. Seems MySQL has an extra feature of sorts to return longer strings. In that case I would consider the second "bug" a "nice to have" instead.

threedaymonk commented 10 years ago

Re 1: This is as documented, though the documentation isn't very good and I agree that this behaviour is probably counter-intuitive.

I think it would make more sense if it processed only A-Z in the input, but that could be a breaking change for any applications that rely on the current behaviour (though I'd be surprised if there are any).

Re 2: As you've noted, MySQL is not standard. Soundex is supposed to generate 4-element codes. I'm not sure if that's really nice to have, or just buggy!

If I were to make any significant changes to Soundex, I think I'd want to add support for various dialects: Russell; Knuth; US government; Daitch-Mokotoff. On the other hand, I wonder how useful this is given that Metaphone gives better results for this application.

yorickpeterse commented 10 years ago

For me it's not super critical as I've since discovered the (double) metaphone implementation which worked better for my use case. Documentation wise, I completely missed that :/

threedaymonk commented 10 years ago

I'll close this, then, and make a note to improve the documentation in future.