olliebennett / reporrigory

Finding great food/band combinations!
0 stars 0 forks source link

Phonetic Hash Algorithms #1

Open olliebennett opened 10 years ago

olliebennett commented 10 years ago

As described here:

Using a traditional fuzzy match algorithm to compute the closeness of two arbitrary strings is expensive, though, and it isn't appropriate for searching large data sets. A better solution is to compute hash values for entries in the database in advance, and several special hash algorithms have been created for this purpose. These phonetic hash algorithms allow you to compare two words or names based on how they sound, rather than the precise spelling.

Implementing this would be awesome as a new scoring mechanism to accompany the others.

coughee commented 10 years ago

I see you implemented it already, good job : )!

On 18 June 2014 12:07, Ollie Bennett notifications@github.com wrote:

As described here http://www.informit.com/articles/article.aspx?p=1848528:

Using a traditional fuzzy match algorithm to compute the closeness of two arbitrary strings is expensive, though, and it isn't appropriate for searching large data sets. A better solution is to compute hash values for entries in the database in advance, and several special hash algorithms have been created for this purpose. These phonetic hash algorithms allow you to compare two words or names based on how they sound, rather than the precise spelling.

Implementing this would be awesome as a new scoring mechanism to accompany the others.

— Reply to this email directly or view it on GitHub https://github.com/olliebennett/reporrigory/issues/1.

coughee commented 10 years ago

I haven't found a food list that is very satisfactory so I've started work on a script that will create one from http://en.wikipedia.org/wiki/Category:Lists_of_foods

On 19 June 2014 06:38, Jonathan Keelan sock.uk@gmail.com wrote:

I see you implemented it already, good job : )!

On 18 June 2014 12:07, Ollie Bennett notifications@github.com wrote:

As described here http://www.informit.com/articles/article.aspx?p=1848528:

Using a traditional fuzzy match algorithm to compute the closeness of two arbitrary strings is expensive, though, and it isn't appropriate for searching large data sets. A better solution is to compute hash values for entries in the database in advance, and several special hash algorithms have been created for this purpose. These phonetic hash algorithms allow you to compare two words or names based on how they sound, rather than the precise spelling.

Implementing this would be awesome as a new scoring mechanism to accompany the others.

— Reply to this email directly or view it on GitHub https://github.com/olliebennett/reporrigory/issues/1.

olliebennett commented 10 years ago

I've implemented only a basic script to test a metaphone library I found, but have not implemented the "double metaphone" functionality yet, and haven't really found a great way to "score" a pair of words based on the similarity of the generated metaphone "hash". If they're the same, then great - the words sound almost alike, but even accepting a difference of one character (for example taking the Levenshtein distance between the two metaphone hashes) starts to include all kinds of crap matches. We'll have to think about that.