jamesturk / jellyfish

🪼 a python library for doing approximate and phonetic matching of strings.
https://jamesturk.github.io/jellyfish/
MIT License
2.04k stars 157 forks source link

Added QWERTY Euclidean distance (qwerty_levenshtein) #81

Closed RJ-lovering closed 7 years ago

RJ-lovering commented 7 years ago

Also added backtrace and and reader-friendly operations summary for qwerty_levenshtein.

jamesturk commented 7 years ago

I can't find any reference to this algorithm via Google, mind providing a source?

Also it looks like this is failing tests, but we can wait to discuss those until a decision has been made about inclusion

RJ-lovering commented 7 years ago

I'm not sure of its independent algorithmic name, myself, or if it has one. It's just a variation on levenshtein-damerau that calculates substitution cost sensitive to the physical distance between keys on a a QWERTY keyboard and the effort of pressing SHIFT (the z value of the coordinate system). I got the idea from this stack overflow exchange http://stackoverflow.com/questions/29233888/edit-distance-such-as-levenshtein-taking-into-account-proximity-on-keyboard, and developed it a bit further to accommodate some of the data I was working with, in pursuit of finding duplicate entries for names of people and medications in a database I work with.

It has served reasonably well as a component in glomming together typos without bias toward higher-scoring in cases where letters at the beginning of a word are the same (I did not find any research to support the idea that typos are significantly more likely to occur at the end of a word than at the beginning, and it was causing some tedium in the manual review phase of my own project to weight them differently).

If it's not something that jellyfish ought to have, I'm happy to keep it on my branch and keep that branch to myself, but I thought others might find it useful.

On Wed, Apr 12, 2017 at 11:38 AM, James Turk notifications@github.com wrote:

I can't find any reference to this algorithm via Google, mind providing a source?

Also it looks like this is failing tests, but we can wait to discuss those until a decision has been made about inclusion

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jamesturk/jellyfish/pull/81#issuecomment-293618682, or mute the thread https://github.com/notifications/unsubscribe-auth/AXQ3YrELe2xklh_Rt-Pn4OZ2dZXdNil2ks5rvO_3gaJpZM4M7jR0 .

jamesturk commented 7 years ago

I think it'd be tricky to keep up the maintenance on my end (and in keeping parity w/ the others I'd need to port it to C & Go) - it is a neat idea though, have you thought of rolling your own package?

On Wed, Apr 12, 2017 at 11:52 AM, RJ-lovering notifications@github.com wrote:

I'm not sure of its independent algorithmic name, myself, or if it has one. It's just a variation on levenshtein-damerau that calculates substitution cost sensitive to the physical distance between keys on a a QWERTY keyboard and the effort of pressing SHIFT (the z value of the coordinate system). I got the idea from this stack overflow exchange http://stackoverflow.com/questions/29233888/edit- distance-such-as-levenshtein-taking-into-account-proximity-on-keyboard, and developed it a bit further to accommodate some of the data I was working with, in pursuit of finding duplicate entries for names of people and medications in a database I work with.

It has served reasonably well as a component in glomming together typos without bias toward higher-scoring in cases where letters at the beginning of a word are the same (I did not find any research to support the idea that typos are significantly more likely to occur at the end of a word than at the beginning, and it was causing some tedium in the manual review phase of my own project to weight them differently).

If it's not something that jellyfish ought to have, I'm happy to keep it on my branch and keep that branch to myself, but I thought others might find it useful.

On Wed, Apr 12, 2017 at 11:38 AM, James Turk notifications@github.com wrote:

I can't find any reference to this algorithm via Google, mind providing a source?

Also it looks like this is failing tests, but we can wait to discuss those until a decision has been made about inclusion

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jamesturk/jellyfish/pull/81#issuecomment-293618682, or mute the thread https://github.com/notifications/unsubscribe-auth/AXQ3YrELe2xklh_Rt- Pn4OZ2dZXdNil2ks5rvO_3gaJpZM4M7jR0 .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jamesturk/jellyfish/pull/81#issuecomment-293623069, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAfYg33rffPE8MsEt9B0mE_qOx4cCU8ks5rvPMkgaJpZM4M7jR0 .

RJ-lovering commented 7 years ago

I hadn't until you made the suggestion, actually, but since this is one in a suite of tools I'm developing for my department, I might make a go of it once I've figured out what's got the strongest reuse value. Thanks for the suggestion!

I will, in the meantime, keep my branch to myself. Jellyfish is great work, thanks for sharing it and keeping it up.

On Wed, Apr 12, 2017 at 1:43 PM, James Turk notifications@github.com wrote:

I think it'd be tricky to keep up the maintenance on my end (and in keeping parity w/ the others I'd need to port it to C & Go) - it is a neat idea though, have you thought of rolling your own package?

On Wed, Apr 12, 2017 at 11:52 AM, RJ-lovering notifications@github.com wrote:

I'm not sure of its independent algorithmic name, myself, or if it has one. It's just a variation on levenshtein-damerau that calculates substitution cost sensitive to the physical distance between keys on a a QWERTY keyboard and the effort of pressing SHIFT (the z value of the coordinate system). I got the idea from this stack overflow exchange http://stackoverflow.com/questions/29233888/edit- distance-such-as-levenshtein-taking-into-account-proximity-on-keyboard, and developed it a bit further to accommodate some of the data I was working with, in pursuit of finding duplicate entries for names of people and medications in a database I work with.

It has served reasonably well as a component in glomming together typos without bias toward higher-scoring in cases where letters at the beginning of a word are the same (I did not find any research to support the idea that typos are significantly more likely to occur at the end of a word than at the beginning, and it was causing some tedium in the manual review phase of my own project to weight them differently).

If it's not something that jellyfish ought to have, I'm happy to keep it on my branch and keep that branch to myself, but I thought others might find it useful.

On Wed, Apr 12, 2017 at 11:38 AM, James Turk notifications@github.com wrote:

I can't find any reference to this algorithm via Google, mind providing a source?

Also it looks like this is failing tests, but we can wait to discuss those until a decision has been made about inclusion

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/jamesturk/jellyfish/pull/81#issuecomment-293618682 , or mute the thread https://github.com/notifications/unsubscribe-auth/AXQ3YrELe2xklh_Rt- Pn4OZ2dZXdNil2ks5rvO_3gaJpZM4M7jR0 .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jamesturk/jellyfish/pull/81#issuecomment-293623069, or mute the thread https://github.com/notifications/unsubscribe- auth/AAAfYg33rffPE8MsEt9B0mE_qOx4cCU8ks5rvPMkgaJpZM4M7jR0 .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jamesturk/jellyfish/pull/81#issuecomment-293654726, or mute the thread https://github.com/notifications/unsubscribe-auth/AXQ3Yvw55x1AiqFzp0mJIrJSuH963JMoks5rvQ1LgaJpZM4M7jR0 .

jamesturk commented 7 years ago

sounds good & thanks for understanding :)