jamesturk / jellyfish

🪼 a python library for doing approximate and phonetic matching of strings.
https://jamesturk.github.io/jellyfish/
MIT License
2.07k stars 160 forks source link

Adding QWERTY support to DL distance #92

Open DocShahrukh opened 6 years ago

DocShahrukh commented 6 years ago

Adjusting cost in DL distance for QWERTY keypad mistakes, may be to others too. Please see if you're free.

key_pairs = [{'q','a'},{'q','w'},{'w','a'},{'w','e'},{'w','s'},{'e','s'},{'e','d'},{'e','r'},{'r','d'},{'r','f'},{'r','t'},{'t','g'},{'t','y'},{'y','g'},{'y','h'},{'y','u'},{'u','h'},{'u','j'},{'u','i'},{'i','j'},{'i','k'},{'i','o'},{'o','k'},{'o','l'},{'o','p'},{'l','k'},{'m','k'},{'m','n'},{'n','j'},{'n','b'},{'b','h'},{'b','v'},{'v','g'},{'v','c'},{'c','f'},{'c','x'},{'x','d'},{'x','z'},{'z','s'}]

def damerau_levenshtein_cost(a,b): if a==b : return 0 elif set([a,b]) in key_pairs: return .25 return 1

cost = damerau_levenshtein_cost(s1[i-1],s2[j-1])

I wish this hack finds some stack

jsfenfen commented 6 years ago

This is really interesting. @DocShahrukh would this approach work for OCR errors too, assuming one came up with a useful weighting? So '1' paired with 'l', etc.

jamesturk commented 6 years ago

I think it’d make sense to make the cost function configurable, that’d let people do this for different layouts or ocr specific functions. I’d be glad to incorporate such a PR if anyone has time

On Thu, Dec 21, 2017 at 12:54 PM Jacob Fenton notifications@github.com wrote:

This is really interesting. @DocShahrukh https://github.com/docshahrukh would this approach work for OCR errors too, assuming one came up with a useful weighting? So '1' paired with 'l', etc.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jamesturk/jellyfish/issues/92#issuecomment-353415328, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAfYjwm1znugDdT4oG1TdkkS1zuz5Onks5tCptcgaJpZM4RJpuI .

DonaldTsang commented 6 years ago

Don't forget QWERTZ (Germany, Austria and Eastern Europe)!

DimitriPapadopoulos commented 2 years ago

And AZERTY, Dvorak, and other layouts. Cost function really needs to be configurable, perhaps with a few standard costs.

This is really interesting. @DocShahrukh would this approach work for OCR errors too, assuming one came up with a useful weighting? So '1' paired with 'l', etc.

Indeed, see for example: