Open DocShahrukh opened 6 years ago
This is really interesting. @DocShahrukh would this approach work for OCR errors too, assuming one came up with a useful weighting? So '1' paired with 'l', etc.
I think it’d make sense to make the cost function configurable, that’d let people do this for different layouts or ocr specific functions. I’d be glad to incorporate such a PR if anyone has time
On Thu, Dec 21, 2017 at 12:54 PM Jacob Fenton notifications@github.com wrote:
This is really interesting. @DocShahrukh https://github.com/docshahrukh would this approach work for OCR errors too, assuming one came up with a useful weighting? So '1' paired with 'l', etc.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jamesturk/jellyfish/issues/92#issuecomment-353415328, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAfYjwm1znugDdT4oG1TdkkS1zuz5Onks5tCptcgaJpZM4RJpuI .
Don't forget QWERTZ (Germany, Austria and Eastern Europe)!
And AZERTY, Dvorak, and other layouts. Cost function really needs to be configurable, perhaps with a few standard costs.
This is really interesting. @DocShahrukh would this approach work for OCR errors too, assuming one came up with a useful weighting? So '1' paired with 'l', etc.
Indeed, see for example:
Adjusting cost in DL distance for QWERTY keypad mistakes, may be to others too. Please see if you're free.
key_pairs = [{'q','a'},{'q','w'},{'w','a'},{'w','e'},{'w','s'},{'e','s'},{'e','d'},{'e','r'},{'r','d'},{'r','f'},{'r','t'},{'t','g'},{'t','y'},{'y','g'},{'y','h'},{'y','u'},{'u','h'},{'u','j'},{'u','i'},{'i','j'},{'i','k'},{'i','o'},{'o','k'},{'o','l'},{'o','p'},{'l','k'},{'m','k'},{'m','n'},{'n','j'},{'n','b'},{'b','h'},{'b','v'},{'v','g'},{'v','c'},{'c','f'},{'c','x'},{'x','d'},{'x','z'},{'z','s'}]
def damerau_levenshtein_cost(a,b): if a==b : return 0 elif set([a,b]) in key_pairs: return .25 return 1
cost = damerau_levenshtein_cost(s1[i-1],s2[j-1])
I wish this hack finds some stack