kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
253 stars 46 forks source link

Seemingly odd partial matching behavior #67

Closed bengoehring closed 1 year ago

bengoehring commented 1 year ago

Hi there,

Thank you for writing, and especially maintaining, such a great package. I worry this is going to be a silly question -- and I apologize if that is the case.

I am trying to assign unique ids to a roster of names. It seems that some of the matches however are too inclusive and include, as in the example below, strings that seem to be too distinct from one another to be considered matches. At the bottom I am including a screenshot of an example of this on the full dataset so you can get a better sense of the range of different names that are considered matches.

I am guessing this behavior can be fixed by tweaking some parameters (even though the threshold matching level is > .9?), but I wanted to bring it up here too in case something else is going awry.

Thanks again, Ben

library(tidyverse)
library(fastLink)

sample_data <- structure(list(fiscal_year = c(2011, 2020, 2021, 2012, 2016, 
2015, 2017, 2017, 2016, 2019, 2019, 2020, 2010, 2014, 2017, 2019, 
2020, 2016, 2015, 2013, 2010, 2010, 2009, 2011, 2009, 2013, 2018, 
2013, 2009, 2017, 2010, 2021, 2021, 2021, 2015, 2014, 2014, 2013, 
2013, 2017, 2013, 2014, 2016, 2012, 2011, 2019, 2016, 2013, 2017, 
2010, 2011, 2011, 2020, 2021, 2015, 2012, 2012, 2009, 2017, 2019, 
2009, 2014, 2021, 2015, 2011, 2021, 2019, 2017, 2012, 2014, 2009, 
2013, 2010, 2021, 2012, 2021, 2013, 2015, 2015, 2015, 2015, 2013, 
2018, 2010, 2011, 2014, 2011, 2015, 2014, 2013, 2016, 2012, 2012, 
2014, 2018, 2016, 2016, 2009, 2014, 2021, 2015, 2010, 2014, 2018, 
2021, 2019, 2010, 2020, 2017, 2009, 2010, 2014, 2018, 2009, 2020, 
2009, 2019, 2019, 2016, 2012, 2018, 2020, 2019, 2014, 2014, 2016, 
2019, 2010, 2015, 2021, 2012, 2013, 2014, 2009, 2015, 2016, 2016, 
2020, 2015, 2012, 2016, 2015, 2011, 2016, 2009, 2019, 2014, 2013, 
2021, 2019, 2020, 2016, 2019, 2010, 2014, 2020, 2021, 2013, 2016, 
2015, 2010, 2018, 2020, 2017, 2016, 2011, 2016, 2017, 2018, 2015, 
2010, 2012, 2019, 2020, 2021, 2016, 2020, 2014, 2016, 2016, 2009, 
2016, 2018, 2016, 2015, 2021, 2017, 2011, 2021, 2018, 2010, 2015, 
2017, 2021, 2012, 2014, 2013, 2010, 2015, 2011, 2015, 2019, 2012, 
2010, 2010, 2020, 2021, 2016, 2012, 2016, 2011, 2014, 2016, 2009, 
2019, 2015, 2017, 2018, 2014, 2021, 2017, 2010, 2013, 2016, 2020, 
2014, 2017, 2013, 2018, 2019, 2013, 2011, 2019, 2011, 2013, 2013, 
2014, 2009, 2018, 2018, 2009, 2021, 2015, 2015, 2018, 2014, 2015, 
2012, 2018, 2014, 2017, 2015, 2010, 2016, 2013, 2019, 2016, 2014, 
2009, 2019, 2009, 2018, 2013, 2011, 2020, 2020, 2009, 2012, 2011, 
2010, 2010, 2017, 2012, 2012, 2009, 2014, 2009, 2016, 2019, 2009, 
2010, 2019, 2014, 2010, 2009, 2018, 2018, 2014, 2016, 2009, 2013, 
2020, 2012, 2019, 2021, 2016, 2021, 2009, 2016, 2014, 2015, 2018, 
2010, 2016, 2016, 2016, 2010, 2011, 2015, 2014, 2009, 2009, 2012, 
2011, 2012, 2018, 2015, 2019, 2018, 2021, 2016, 2019, 2019, 2014, 
2021, 2021, 2018, 2014, 2021, 2010, 2020, 2010, 2014, 2011, 2012, 
2021, 2020, 2009, 2016, 2018, 2011, 2018, 2014, 2019, 2014, 2020, 
2014, 2014, 2019, 2014, 2020, 2021, 2012, 2015, 2010, 2011, 2009, 
2009, 2010, 2016, 2016, 2011, 2019, 2021, 2010, 2020, 2009, 2016, 
2016, 2017, 2015, 2013, 2019, 2012, 2012, 2014, 2011, 2013, 2011, 
2015, 2015, 2009, 2016, 2016, 2021, 2009, 2021, 2009, 2019, 2013, 
2019, 2012, 2019, 2011, 2020, 2012, 2015, 2009, 2012, 2020, 2011, 
2010, 2018, 2010, 2012, 2009, 2015, 2014, 2021, 2019, 2009, 2018, 
2018, 2021, 2019, 2013, 2015, 2010, 2017, 2014, 2019, 2011, 2018, 
2011, 2017, 2012, 2014, 2014, 2014, 2011, 2014, 2016, 2016, 2020, 
2019, 2019, 2017, 2017, 2016, 2011, 2009, 2012, 2012, 2017, 2010, 
2015, 2013, 2016, 2019, 2014, 2019, 2018, 2018, 2016, 2020, 2009, 
2012, 2014, 2015, 2017, 2020, 2013, 2010, 2012, 2009, 2012, 2009, 
2017, 2020, 2017, 2021, 2015, 2020, 2017, 2019, 2018, 2016, 2011, 
2020, 2016, 2019, 2018, 2020, 2014, 2012, 2010, 2016, 2012, 2014, 
2009, 2012, 2015, 2009, 2021, 2013, 2011, 2009, 2014, 2017, 2019
), first_name = c("alice", "sarah", "judith", "mary", "brooke", 
"lisa", "michelle", "marian", "frederick", "ryan", "meaghan", 
"nicolaas", "peter", "tara", "sharon", "neira", "keith", "laura", 
"seth", "daniel", "richard", "david", "linda", "michael", "kristi", 
"timothy", "janet", "amy", "sharon", "erin", "suzanne", "timothy", 
"bradley", "matthew", "cathy", "kathleen", "monica", "john", 
"anita", "kristen", "eddie", "michael", "david", "sean", "gregory", 
"shana", "christopher", "christopher", "elizabeth", "sheila", 
"derick", "susan", "craig", "samuel", "margaret", "stephen", 
"william", "joshua", "tami", "courtney", "hedy", "sara", "francis", 
"kristopher", "cortland", "william", "sharon", "james", "robert", 
"barbara", "michael", "kevin", "leona", "sylvie", "dan", "casey", 
"helen", "keeley", "carla", "laura", "theresa", "isaac", "shelley", 
"kevin", "samantha", "tamatha", "james", "michael", "merideth", 
"carl", "anthony", "david", "susan", "laurie", "donald", "christopher", 
"rhonda", "jennifer", "thomas", "thea", "margaret", "melvin", 
"lucy", "samuel", "ashley", "deett", "naida", "nicole", "aaron", 
"marsha", "william", "nicolaas", "eloise", "mary", "bryan", "kevin", 
"nicole", "david", "karen", "dean", "elmer", "stacy", "sarah", 
"connie", "barbara", "ann", "julie", "harold", "anthony", "kevin", 
"jason", "karen", "donna", "christine", "philip", "richard", 
"alex", "garrett", "brad", "steven", "christopher", "stephen", 
"peter", "samantha", "thomas", "stuart", "daniel", "cassandra", 
"james", "arthur", "manuel", "daniel", "amanda", "robert", "ellen", 
"ryan", "matthew", "alyssa", "alexis", "jeffrey", "gloria", "stephanie", 
"christopher", "shannon", "john", "thomas", "brian", "ahmet", 
"tiffany", "alex", "lucy", "ashley", "hannah", "robert", "stuart", 
"kathleen", "gabriel", "vinny", "brandy", "shelia", "andrew", 
"cindy", "neil", "stephen", "nicole", "elaine", "joshua", "christopher", 
"kelly", "jessica", "kira", "gary", "renee", "scott", "jeremy", 
"brenda", "howard", "w", "winston", "albert", "kris", "calvin", 
"patricia", "ross", "kristin", "ray", "skylar", "caitlin", "amanda", 
"joseph", "brian", "sandra", "ryan", "ben", "anitalouise", "joan", 
"paul", "tammy", "lisa", "brian", "mary", "nicole", "michael", 
"alyssa", "sandra", "kristen", "samantha", "betty", "laurie", 
"richard", "angela", "john", "donald", "neil", "sally", "margery", 
"katelyn", "george", "caroline", "wendy", "april", "daisy", "kristina", 
"lisa", "jessica", "cory", "barbara", "rodney", "elexandra", 
"paul", "robert", "cathy", "phyllis", "richard", "clancy", "sabine", 
"neil", "tammy", "linda", "paul", "megan", "tim", "sara", "kristen", 
"julie", "amanda", "milan", "kathleen", "john", "ronald", "james", 
"scott", "christine", "susan", "brenda", "janet", "richard", 
"stephen", "james", "jay", "scott", "nicole", "amanda", "cecile", 
"michael", "logan", "conrad", "kenneth", "samantha", "steven", 
"barbara", "laura", "jennifer", "jessica", "cierra", "tammie", 
"teodoro", "terrance", "michelle", "nicole", "heather", "adam", 
"david", "matthew", "gregory", "scarlett", "christopher", "elizabeth", 
"linda", "wendy", "maureen", "robert", "keri", "june", "bruce", 
"brett", "kristin", "christopher", "christopher", "rachel", "tammy", 
"sueann", "garold", "samuel", "marian", "kim", "ashley", "robert", 
"brandi", "richard", "james", "brenda", "michael", "craig", "teresa", 
"brian", "andrea", "james", "patricia", "daniel", "john", "clare", 
"judith", "casey", "brock", "robert", "susan", "stephen", "devon", 
"pamela", "marlowe", "lawrence", "mary", "andrew", "denise", 
"evan", "christopher", "brian", "tracy", "kimberly", "jace", 
"julia", "alysha", "robert", "johnathan", "jason", "deborah", 
"jamie", "laurie", "stefan", "robert", "peter", "david", "hollis", 
"cynthia", "samara", "sean", "elizabeth", "kevin", "aaron", "margaret", 
"daniel", "amy", "steven", "dylan", "cynthia", "ellen", "tammy", 
"daniel", "dragica", "julie", "stephen", "nicholas", "heather", 
"johnathan", "arthur", "john", "thomas", "karen", "michael", 
"william", "michael", "sharon", "alena", "monica", "amy", "thomas", 
"lisbeth", "nicole", "april", "lara", "hazel", "jessie", "brigham", 
"hamed", "christopher", "daniel", "kimberli", "carlton", "james", 
"yasin", "tom", "justin", "ean", "joshua", "rizardo", "katie", 
"zachary", "gregory", "jennifer", "charles", "erik", "amanda", 
"chaveli", "emanuel", "erin", "elizabeth", "crystal", "timothy", 
"christopher", "grace", "dylan", "edith", "mark", "beth", "wendy", 
"natalie", "margaret", "jacob", "suzanne", "chandler", "nyima", 
"robert", "bernadette", "katherinlynn", "timothy", "gary", "jessica", 
"andrew", "kristen", "robert", "sandra", "julie", "richard", 
"guy", "tammy", "ernest", "heidi", "ethan", "john", "hera", "scott", 
"cheryle", "brian", "michele", "edward", "thomas", "philip", 
"dawn", "tommy", "eleanor", "sille", "lori", "lucinda", "ashley", 
"david", "john", "karen", "daniel", "phillip", "leslie", "jeffrey", 
"jamie", "wendy", "anjel", "julie", "allison", "heidi", "chad", 
"jennifer"), middle_name_initial = c("m", "j", "w", "e", "a", 
"a", NA, NA, "w iii", "c", "f", "j", "d", NA, NA, NA, "m", "ann", 
"e", "allen", NA, "e", "l", "j", "l", "b", NA, "l", "r", "e", 
NA, "j", NA, "michael", "lee", "m", "l", "p", NA, NA, "p", "d", 
"s", "m", "r", "l", NA, "w", "a", "m", "a", "a", NA, "p", "b", 
"a", "john", "p", NA, "t", "a", "anne", "x", "j", "t", NA, "k", 
"jerrett", "f", "l", "s", "r", "m", "m", "m", "allen", "e", "b", 
"m", "anne", "a", NA, "s", NA, "j", "j", "d", "a", NA, "b", NA, 
"r", "m", NA, "j", "d", "f", NA, "ian", "j", "l", "p", "m", NA, 
"s", NA, "a", NA, "t", "l", "c", "j", NA, "c", NA, "m", "a", 
NA, "e", "william", "j", "lynn", NA, "m", "m", "m", "leann", 
"k", "p", "j", "h", "k", "marie", "h", "b", "s", "p", "m", "c", 
"a", "w", "m", "david", "c", "j", "nils", "h", "s", "b", NA, 
"paul", NA, "jo", "p", "grace", "a", NA, "m", "m", "r", "k", 
"d", NA, "a", "joseph", "e", "gregory", NA, "j", "m", NA, NA, 
"r", "e", "g", "m", NA, "m", "ann", "r", NA, "a", NA, "alan", 
"a", "l", "m", "w", NA, NA, "lindsey", "richard", "l", "robert", 
"k", "j", "a", "b", NA, "m", "a", NA, "a", "l", "charmaine", 
NA, "e", NA, NA, "arthur", "k", "s", "charles", "d", NA, "m", 
NA, "jean", "dulsky", "c", "b", "m", "j", "r", NA, NA, "a", "a", 
"d", NA, "m", "t", "l", "w", "j", "c", NA, NA, "a", "lee", "e", 
NA, NA, "marie", "m", "r", "j", "w", "m", "h", "donald", NA, 
"a", "s", "i", NA, "d", "j", "l", "e", "a", NA, "e", "a", "a", 
"m", NA, "m", "m", "l", "h", "t", "m", "h", "jean", "k", "f", 
"andrew", NA, "e", "t", "m", "r", "j", "scott", "s", "m", "d", 
"j", "p", "l", "v", "a", "martinez", "a", "l", "a", "b", "d", 
NA, "a", NA, "c", "c", "a", "p", NA, "m", NA, "j", NA, "e", "l", 
"b", "l", "a", "m", NA, NA, NA, NA, "christie", "f", "charles", 
"m", "c", "m", "h", "nicole", "milton", "r", NA, "d", "a", NA, 
"e", "elizabeth", "p", "c", "scott", "t", "l", NA, "steven", 
"n", "t", "e", "d", "ms", "j", NA, "p", "ellen", NA, "despina", 
"m", "l", NA, NA, "marie", NA, "a", "n", NA, "f", "c", "l", NA, 
"e", "w", "e", "d", "s", NA, "l", "s", NA, "b", "a", "a", "m", 
"c", "marie kravetz", "j", NA, "l", "e", "m", "d", NA, "m", "w", 
"lars", NA, "james", "j", "m", "j", "a", "t", "j", "l", "dwyer", 
"m", "a", "marie", "j", "a", "r", "j", NA, NA, "laine", NA, NA, 
"p", "f", "a", "w", "e", "m", "i", "r", NA, "a", "a", NA, "edward", 
"p", "l", "j", "m", NA, NA, NA, NA, "s", "mae", "c", NA, "f", 
NA, "l", "t", "b", "a", "c", "susan", NA, NA, "alexander", NA, 
"a", "m", "e", "p", "t", "mary", "a", "m", "lukas", NA, NA, NA, 
"s", "l", "h", "j", NA, "ross", "morgan", "f", "a", "c", "m", 
"g", "e", "s", "m", "j", NA, "lyster", "ann", "a", "m", "t", 
"edward", "a", "c", "j", "g", "l", "a", "l", "brittany", "j", 
"r", "lynn", "lee", NA), last_name_join = c("emmons", "copen", 
"ehrlich", "bizzari", "brittell", "bruce", "hastry", "petrides", 
"ross", "knox", "kelley", "garbacik", "davenport", "lombardi", 
"mallory", "valentic", "gallant", "nicolai", "hisman", "thompson", 
"boulanger", "tremblay", "bates", "wilson", "wheeler", "clear", 
"carpenter", "harrington", "holland", "hodges", "santarcangelo", 
"pricer", "pilette", "conte", "hartshorn", "hill", "light", "pellegrini", 
"chadderton", "vrancken", "earle", "walker  ii", "bailey", "hilpl", 
"schlueter", "blanchard", "chadwick", "olson", "riley", "merchant", 
"lind", "zeller", "digiammarino", "truex", "schwartz", "brooks", 
"orosz  iii", "hulett", "walker", "sanford", "harris", "molino", 
"aumand  iv", "cronin", "corsones", "pendlebury", "batdorff", 
"braid", "gunn  jr", "giffin", "bertrand", "klamm", "goebel", 
"hebert", "fraysier", "laplante", "suntag", "weening", "frappier", 
"conway", "wood", "sponem", "jerman", "aremburg", "baigelman", 
"green", "reed  jr", "carlisle", "plumpton", "mallette  jr", 
"egizi", "lambert", "teske", "dahlin", "einhorn", "billado", 
"sheffield", "royer", "mcmurdo", "schwartz", "burke", "quesnel", 
"boyden", "winship", "godzik", "cross", "beutel", "hersey", "connor", 
"rowell", "deveneau", "garbacik", "harris", "spicer", "scrubb", 
"lacross", "lyford", "hosford", "nelson", "webb", "deforge  iii", 
"carpenter", "trombly", "laplant", "prentice", "gosselin", "gilpin", 
"rock", "manfredi", "mullin", "maxham", "lamorder", "amiot", 
"howe", "scott", "donahey  iii", "emerson", "gonzalez", "james", 
"chadwick", "baird", "riendeau", "dufault", "tullar", "trudeau", 
"johnson", "raddock", "edson", "euber", "blackhawk", "sainz", 
"jarvis", "baslow", "kenney", "livingston", "quenneville", "cetin", 
"mullan", "mclean", "merrell", "naughton", "lanphear", "chadwick", 
"huntington", "savasta", "unkles", "irish", "mujkanovic", "mason", 
"dees", "leriche", "whitehill", "phelps", "ryan", "schurr", "pickens", 
"cameron", "barbiero", "robillard", "martin", "bernier", "laraway", 
"herrick", "dixon", "corrao", "duke", "harless", "boucher", "ireland", 
"hill", "mclenithan", "mcginnis", "spaulding", "carpenter", "thompson", 
"persons", "boutwell", "young", "weston  jr", "metayer", "rowley", 
"kelley", "damery", "nagy", "wackley", "allen", "guidadailey", 
"stanton", "mable", "laporte", "ingalls", "holt", "mcpartland", 
"katz", "thomason", "rock", "badger", "mellish", "watkins", "nowak", 
"munger", "tousignant", "mcardle", "chaffee", "lorette", "rajewski", 
"devenger", "nuovo", "sabens", "spiese", "owen", "maccallum", 
"shaw", "monteith", "adams", "reurink", "homeyer", "buzzell", 
"keller", "dubois", "schwendler", "berbeco", "kiarsis", "vautrain", 
"pomainville", "cheever", "morway", "pratt", "arthers", "erlbaum", 
"gallipo", "cappetta", "girard", "rowden  ii", "desmet", "baldwin", 
"desmond", "kennison", "riddell", "stagner", "grenier", "brick", 
"teachout", "goldstein", "young", "preston", "hladik", "peyerl", 
"becker", "taylor", "greenwood", "crowley", "dorer", "tarshis", 
"colburn", "dunigan", "riviezzo", "marcoux  jr", "smith", "dudley", 
"cookingham", "twamley", "morrison", "manley", "furgat", "riley", 
"cote", "mason", "fontaine", "bullard", "johnston", "lapierre", 
"hart", "berry", "carter", "beauregard", "remolador", "richardson", 
"salvador", "rogers", "hannan", "silverman", "willey", "langham", 
"rainville", "burgess", "barker", "lawrence", "dilena", "edwards", 
"carr", "ryan", "starr", "sundberg", "whipple", "perry", "martin", 
"berg", "adams", "perryhannam", "pidgeon", "clark", "davis", 
"jensen", "crandall", "winslow", "berliner", "finnegan", "crisante", 
"norway  jr", "jones", "caforiaweeber", "osborne", "dusablon", 
"dimas", "turner", "dumas", "deeghan", "halloran", "zorzi", "mandeville", 
"oshaughnessy", "henkin", "leach", "rutter", "letourneau", "meyer", 
"mccarthy", "coleman", "skriletz", "galbraith", "cupoli", "west", 
"willette", "keating", "wojtkowiak", "corbin", "alimena", "harrington", 
"humphrey", "curtis", "paradiso", "kane", "melcher", "croft", 
"deforge", "mercy", "christian", "brown", "strohmaier", "suckert", 
"danles", "laclair", "schwannoble", "boron", "fitzgerald", "krevetski", 
"mcdonald", "pisani", "moore", "jansch", "marcellus", "saunders", 
"yannacone", "lawrence", "davis", "parr doering", "saltus", "roy", 
"gagulic", "arms", "simoes", "olson", "fazekas", "bixby", "hamlin", 
"federico", "donovan  jr", "brooks", "collins", "soule", "ryan", 
"stepp", "farrell", "steel", "mercier", "malinowski", "kokx", 
"marabella", "green", "sobel", "fay", "mackinnon", "reese", "kone", 
"cadorette", "nicasio", "ward", "fuller", "altman", "ibrahim", 
"evslin", "burkewitz", "briere", "larose", "reynoso", "kinter", 
"manchester", "kalinoski", "isham", "keefer", "fischer", "ciecior", 
"miles", "betz", "singer", "white", "zada", "kilby", "barker", 
"winters", "newton", "sullivan", "ferguson", "irvine", "walker", 
"santamore", "becker", "bovee", "bushee", "corbett", "tsamchoe", 
"farr", "vermette", "fox", "hewitt", "nowak", "burrill", "adams", 
"jensen", "wohland", "maisonet", "campbell", "powell  ii", "tapper", 
"wilt", "patnoe", "ibey", "mclaughlin", "woodward", "bosley", 
"williams", "weeden", "donaldson", "betit", "domey", "fields", 
"daye", "fleury", "walz", "smith", "larsen", "potvin", "holtrop", 
"gregory", "minot", "adams  iv", "lafond", "riley", "edgerley", 
"russell", "martin", "sheltra", "houstonanderson", "robins", 
"green", "keelty", "austin", "gibney", "quero"), suffix = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, " ii", NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, " iii", NA, NA, NA, NA, NA, " iv", 
NA, NA, NA, NA, NA, " jr", NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, " jr", NA, NA, " jr", NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, " iii", NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, " iii", NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, " jr", NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, " ii", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, " jr", NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, " jr", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, " jr", NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, " ii", NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, " iv", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA)), row.names = c(NA, -500L), class = c("tbl_df", "tbl", "data.frame"
))

fl_out <- fastLink(
  dfA = sample_data, 
  dfB = sample_data,
  varnames = c("first_name", 
               "middle_name_initial",
               "last_name_join"),
  stringdist.match = c("first_name", 
                       "middle_name_initial",
                       "last_name_join"),
  partial.match = c("first_name", 
                    "middle_name_initial",
                    "last_name_join"),
  threshold.match = .90,
  n.cores = 5)

matches_out <- getMatches(
  dfA = sample_data, 
  dfB = sample_data,
  fl.out = fl_out,
  threshold.match = .90)

matches_out %>% 
  filter(dedupe.ids == 325)

Screen Shot 2022-11-07 at 1 59 25 PM

aalexandersson commented 1 year ago

Disclaimer: I am a regular fastLink user, not a developer.

I suggest that you need at least one more linkage concept than Gender and Name.

One of several possible recommendations is to add Address or Date_of_birth; the conceptual algorithm is known as ADGN (Ansolabehere and Hersh 2017). As another example, I often find the 9-digit Social Security Number (SSN) to be very useful as a linkage variable.

I do not see you use fiscal_year for the linkage. It seems not to be needed to illustrate your issue.

Reference: Stephen Ansolabehere & Eitan D. Hersh (2017) ADGN: An Algorithm for Record Linkage Using Address, Date of Birth, Gender, and Name, Statistics and Public Policy, 4:1,1-10, DOI: 10.1080/2330443X.2017.1389620

Anders

bengoehring commented 1 year ago

Thanks for getting back to me!

Yes, additional linkage variables would be great but unfortunately I often am working with just names. If I need to just increase cut.p when I am using only names as linkage variables that is fine. I just wanted to be sure I was not missing something obvious.

Ben

aalexandersson commented 1 year ago

Is your purpose to have as few as possible duplicates? Then, you could combine the names into a more discriminating variable, which would result in less duplicates than now. For example, you could create one name variable from first_name + middle_name_initial + last_name_initial. In the example, the first four names would be "jamesea", "jameseb", "jameseb", and "jamesed".

tedenamorado commented 1 year ago

Thanks @aalexandersson for always providing great advice!

The problem is that for most observations in your sample data, the middle initial is just one letter, so it is basically a categorical variable that can take around 26 possible values (22 in your sample data if you trim middle names to be represented by just one letter).

@bengoehring did you try removing middle_name_initial from the list of variables that will be compared using a string similarity comparator? If you do so, then the comparison for the middle initial will be made in terms of exact matching for that variable.

Keep us posted!

Ted

bengoehring commented 1 year ago

Thanks everybody. I really appreciate all of the suggestions. I will try exact matching on the middle name/initial variable and see how that looks.