jamesturk / jellyfish

🪼 a python library for doing approximate and phonetic matching of strings.
https://jamesturk.github.io/jellyfish/
MIT License
2.04k stars 157 forks source link

Handle spaces correctly in MRA algorithm #158

Closed juliangilbey closed 2 years ago

juliangilbey commented 2 years ago

In https://github.com/jamesturk/jellyfish/blob/e1be2f9055c698ba9e89c588b7ac321f8ff540b1/jellyfish/_jellyfish.py#L342-L347 the comment says that we append the character to the codex if it is not a space OR starting character and vowel or ..., but the code appends the character if it is (not a space AND starting character and vowel) or (...). So one of them at least is wrong. Having a look at the Wikipedia page https://en.wikipedia.org/wiki/Match_rating_approach, it would seem that both the comment and code are likely to be wrong. What is probably wanted, interpreting the given encoding rules, is the following:

The test for vowels is somewhat convoluted (and slightly incorrect) in the Wikipedia description; the above description is slightly simpler.

This patch implements the above description; all of the tests still pass. There is a parallel PR for the cjellyfish implementation.