k3jph / phonics-in-r

Phonetic Spelling Algorithms in R
https://jameshoward.us/phonics-in-r
Other
28 stars 7 forks source link

Add support for fuzzy Soundex #12

Open howardjp opened 7 years ago

howardjp commented 7 years ago

More information on fuzzy Soundex available from http://wayback.archive.org/web/20100629121128/http://www.ir.iit.edu/publications/downloads/IEEESoundexV5.pdf

KyleHaynes commented 5 years ago

Hi @howardjp,

I have an implementation of this fuzzy soundex I'd be happy to go into this package if you want it. I've translated it from the following: https://yomguithereal.github.io/talisman/phonetics/

Currently uses the stringi (with a slightly time benefit), could easily be written in pure base R (base R code included)

I've also read the paper and unless I'm missing something, it aligns with it.


On another note. I've also translated some keyers as well (skeleton and omission): https://yomguithereal.github.io/talisman/keyers

Not sure if you'd see scope for them to be in phonics?

#' @aliases fuzzy_soundex
#'
#'
#' @title Implementation of the "Fuzzy Soundex" algorithm.
#'
#'
#' @description Implementation of the "Fuzzy Soundex" algorithm. Function taking a a vector of names and computing its Fuzzy Soundex code.
#'
#'
#' @param word String or vector of strings to encode.
#'
#'
#' @return Returns a vector of encoded strings.
#'
#'
#' @references Article: Holmes, David and M. Catherine McCabe. "Improving Precision and Recall for Soundex Retrieval." 
#'      \url{http://wayback.archive.org/web/20100629121128/http://www.ir.iit.edu/publications/downloads/IEEESoundexV5.pdf} \cr 
#'      Code based on: \url{https://github.com/Yomguithereal/talisman/blob/master/src/phonetics/fuzzy-soundex.js}
#'      Licence: \url{https://github.com/Yomguithereal/talisman/blob/master/LICENSE.txt}.
#'
#'
#' @author Kyle Haynes, \email{kyle@@kylehaynes.com.au}.
#'
#'
#' @examples
#' fuzzy_soundex(c("Holmes", "Hollmes", "David", "Daved", "Catherine", "Kathryn"))
#' 
#' 
#' @export
fuzzy_soundex <-  function(word){

    # ---- Define constants ----
    # The code has been structured in such a way as to follow the following implementation of the algorithm:
        # https://github.com/Yomguithereal/talisman/blob/master/src/phonetics/fuzzy-soundex.js

    # Define 'translation'.
    translation <- c("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 
                     "0193017~07745501769301~7~9")

    # Define sets.
    set1 <- c("CS", "CZ", "TS", "TZ")
    set2 <- c("HR", "WR")
    set3 <- c("KN", "NG")
    set4 <- c("^H|^W|^Y")

    # Define 'rules'.
    rules_from <- c("CA", "CC", "CK", "CE", "CHL", "CL", "CHR", "CR", "CI", "CO", "CU", "CY", "DG", "GH", "MAC", "MC", "NST", "PF", "PH", "SCH", "TIO", "TIA", "TCH")
    rules_to <-   c("KA", "KK", "KK", "SE", "KL",  "KL", "KR",  "KR", "SI", "KO", "KU", "SY", "GG", "HH", "MK", "MK",  "NSS", "FF", "FF", "SSS", "SIO", "SIO", "CHH")

    # ---- Checks ----
    if(!is.vector(word) || is.na(word)){
        stop("Input must be a vector.")
    }

    # ---- Code ----
    # Coerce NA's to "".
    if(any(is.na(word))){
        na_logical <- is.na(word)
        word[is.na(word)] <- ""
    }

    ##   // Deburring the string & dropping any non-alphabetical character
    ##   name = deburr(name)
    ##     .toUpperCase()
    ##     .replace(/[^A-Z]/g, '');
    word <- iconv(word, to = 'ASCII//TRANSLIT')
    word <- toupper(word)
    word <- gsub("[^A-Z]", "", word, perl = TRUE)
    ## if (SET1.has(firstTwoLetters))
    ##     name = 'SS' + rest;
    ## else if (firstTwoLetters === 'GN')
    ##     name = 'NN' + rest;
    ## else if (SET2.has(firstTwoLetters))
    ##     name = 'RR' + rest;
    ## else if (firstTwoLetters === 'HW')
    ##     name = 'WW' + rest;
    ## else if (SET3.has(firstTwoLetters))
    ##     name = 'NN' + rest;
    word <- gsub(paste0("^", paste0(set1, collapse = "|^")), "SS", word, perl = TRUE)
    word <- gsub("^GN", "NN", word, perl = TRUE)
    word <- gsub(paste0("^", paste0(set2, collapse = "|^")), "RR", word, perl = TRUE)
    word <- gsub("^HW", "WW", word, perl = TRUE)
    word <- gsub(paste0("^", paste0(set3, collapse = "|^")), "NN", word, perl = TRUE)

    ##       // Applying some substitutions for endings
    ##   const lastTwoLetters = name.slice(-2),
    ##         initial = name.slice(0, -2);
    ##   if (lastTwoLetters === 'CH')
    ##     name = initial + 'KK';
    ##   else if (lastTwoLetters === 'NT')
    ##     name = initial + 'TT';
    ##   else if (lastTwoLetters === 'RT')
    ##     name = initial + 'RR';
    ##   else if (name.slice(-3) === 'RDT')
    ##     name = name.slice(0, -3) + 'RR';
    word <- gsub("(.{1,})CH$", "\\1KK", word, perl = TRUE)
    word <- gsub("(.{1,})NT$", "\\1TT", word, perl = TRUE)
    word <- gsub("(.{1,})RT$", "\\1RR", word, perl = TRUE)
    word <- gsub("(.{1,})RDT$", "\\1RR", word, perl = TRUE)

    ##   // Applying the rules
    ##   for (let i = 0, l = RULES.length; i < l; i++)
    ##     name = name.replace(...RULES[i]);

    # As gsub is already vectorised, a simple for loop will suffice.
    # for(i in 1:length(rules)){
    #     word <- gsub(rules[[i]][1], rules[[i]][2], word, perl = TRUE)
    # }
    word <- stringi::stri_replace_all_fixed(word, rules_from, rules_to, vectorize_all = FALSE)

    ##   // Caching the first letter
    ##   const firstLetter = name[0];
    first_character <- substring(word, 1, 1)
    ##   // Translating
    ##   let code = '';
    ##   for (let i = 0, l = name.length; i < l; i++)
    ##     code += TRANSLATION[name[i]] || name[i];
    word <- chartr(translation[1], translation[2] , word)
    ##   // Removing hyphens
    ##   code = code.replace(/-/g, '');
    word <- gsub("~", "", word, fixed = TRUE)
    ##   // Squeezing the code
    ##   code = squeeze(code);
    # Code can be found here: https://github.com/Yomguithereal/talisman/blob/master/src/helpers/index.js
        ## /**
        ##  * Function squeezing the given sequence by dropping consecutive duplicates.
        ##  *
        ##  * Note: the name was actually chosen to mimic Ruby's naming since I did not
        ##  * find any equivalent in other standard libraries.
        ##  *
        ##  * @param  {mixed} target - The sequence to squeeze.
        ##  * @return {array}        - The resulting sequence.
        ##  */
        ## export function squeeze(target) {
        ##   const isString = typeof target === 'string',
        ##         sequence = seq(target),
        ##         squeezed = [sequence[0]];
        ## 
        ##   for (let i = 1, l = sequence.length; i < l; i++) {
        ##     if (sequence[i] !== sequence[i - 1])
        ##       squeezed.push(sequence[i]);
        ##   }
        ## 
        ##   return isString ? squeezed.join('') : squeezed;
        ## }        
    word <- gsub("([0-9])\\1+", "\\1", word, perl = TRUE)
    ##   // Dealing with some initials
    ##   if (SET4.has(code[0]))
    ##     code = firstLetter + code;
    ##   else
    ##     code = firstLetter + code.slice(1);
    word[grepl(set4, word)] <- paste0(first_character[grepl(set4, word)], word[grepl(set4, word)])
    word[!grepl(set4, word)] <- paste0(first_character[!grepl(set4, word)], substring(word[!grepl(set4, word)], 2))
    ##   // Dropping vowels
    ##   code = code.replace(/0/g, '');
    word <- gsub("0", "", word, perl = TRUE)

    # Finally, if any NA's were passed, set them back to NA.
    if(exists("na_logical")){
        word[na_logical] <- NA
    }

    # Return hashed word(s).
    return(word)
}
howardjp commented 5 years ago

@KyleHaynes, this seems pretty reasonable. The big question is, do you have test cases?

KyleHaynes commented 5 years ago

@howardjp, good point, let me get back to you (just drowning with work atm).

chrislit commented 5 years ago

If it's any help, here's a set of testcases I use for fuzzy soundex, formatted like your CSV files. fuzzy-soundex.csv.txt

howardjp commented 5 years ago

Awesome sauce!

I am out of town right now, and will look at this later this week or early next!

On Sun, Aug 4, 2019 at 5:51 PM Chris Little notifications@github.com wrote:

If it's any help, here's a set of testcases I use for fuzzy soundex, formatted like your CSV files. fuzzy-soundex.csv.txt https://github.com/howardjp/phonics/files/3465362/fuzzy-soundex.csv.txt

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/howardjp/phonics/issues/12?email_source=notifications&email_token=AABGG2SJFLZKLRWIDACJCQ3QC5FNTA5CNFSM4CZB6N6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3QKNQY#issuecomment-518039235, or mute the thread https://github.com/notifications/unsubscribe-auth/AABGG2T4LJDNDHRKDUM7LBTQC5FNTANCNFSM4CZB6N6A .

KyleHaynes commented 5 years ago

@chrislit - thanks for this

I've run against my function and from the 27 tests strings I get two mismatches (same word)

It's an issue with my function and not the test data.

I'll look into this further

require(data.table)
require(stringi)

csv <- "name,hash
Kristen,K6935
Krissy,K6900
Christen,K6935
peter,P3600
pete,P3000
pedro,P3600
stephen,S3150
steve,S3100
smith,S5300
smythe,S5300
gail,G4000
gayle,G4000
christine,K6935
christina,K6935
kristina,K6935
Wight,W3000
Hardt,H6000
Knight,N3000
Czech,S7000
Tsech,S7000
gnomic,N5900
Wright,R3000
Hrothgar,R3760
Hwaet,W3000
Grant,G6300
Hart,H6000
Hardt,H6000"

dt <- fread(csv)

dt[, "hash_2" := fuzzy_soundex(name)]
# Currently my function doesn't pad trailing 0's. Add this:
dt[, "hash_2" := stri_pad_right(hash_2, 5, 0)]

all.equal(dt$hash, dt$hash_2)

dt[hash != hash_2]
#    name  hash hash_2
# 1: Hardt H6000  R0000
# 2: Hardt H6000  R0000
KyleHaynes commented 5 years ago

Fixed code (Fixed in my original comment).

Re-running previous test now returns no differences.

Re padding out strings, could make this argument driven like the phonics::soundex argument: maxCodeLen = 4L. Could set the length, otherwise, if NULL no padding at all?

I (try) to conform to snake_case, happy to adjust the name of the function to be whatever (to fit in with the rest of the phonics library)

As for the dependency of stringi let me know if you want adjust to base R.

Also, moving forward, I'll create a fork and make changes there.

howardjp commented 5 years ago

Your approach to maxCodeLen is fine...

As for stringi, I am not sure. I mean, I don't mind the dependency. You gotta do what you gotta do, amirite? As a user, I can't imagine it is really objectionable.

On Tue, Aug 6, 2019 at 2:53 AM Kyle Haynes notifications@github.com wrote:

Fixed code (Fixed in my original comment).

Re-running previous test now returns no differences.

Re padding out strings, could make this argument driven like the phonics::soundex argument: maxCodeLen = 4L. Could set the length, otherwise, if NULL no padding at all?

I (try) to conform to snake_case, happy to adjust the name of the function to be whatever (to fit in with the rest of the phonics library)

As for the dependency of stringi let me know if you want adjust to base R.

Also, moving forward, I'll create a fork and make changes there.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/howardjp/phonics/issues/12?email_source=notifications&email_token=AABGG2SGMI3UAW2VKZQAAK3QDENX5A5CNFSM4CZB6N6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UCYQI#issuecomment-518532161, or mute the thread https://github.com/notifications/unsubscribe-auth/AABGG2WRMO4GCZN5YI64QODQDENX5ANCNFSM4CZB6N6A .