jamesturk / jellyfish

🪼 a python library for doing approximate and phonetic matching of strings.
https://jamesturk.github.io/jellyfish/
MIT License
2.07k stars 160 forks source link

Wrong soundex for "Ashcroft"? #83

Closed jmcomets closed 4 years ago

jmcomets commented 7 years ago

TL;DR I believe soundex('ashcroft') == 'A261'

I was implementing my version of soundex here, using jellyfish as my baseline for comparisons, when I stumbled on a surprising difference. Here's what jellyfish returns when computing the soundex for "Ashcroft":

>>> import jellyfish
>>> jellyfish.soundex('ashcroft')
'A226'

Here's some documentation I found concerning this value, taken from Wikipedia:

Using this algorithm, [...] "Ashcraft" and "Ashcroft" both yield "A261" and not "A226" (the chars 's' and 'c' in the name would receive a single number of 2 and not 22 since an 'h' lies in between them).

While this paragraph is unclear on which value is correct, another is pretty clear about it:

two letters with the same number separated by 'h' or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice

This leads me to believe that the soundex returned should be A261 and not A226, as explained in the previous quote. The issue can likely be solved by patching cjellyfish to skip H and W when removing adjacent soundex digits.

CrowbarKZ commented 4 years ago

Just run into this as well. Seems like an old bug.