Custom/adjusted string comparison functions

jrtran commented 3 years ago

Greatly appreciate the work on this package. Our data deals with a wide diversity of names (Hispanic, Asian, etc.), and we've found that the string distance methods included with fastLink have occasional issues:

Masculine and feminine versions of names being matched, like "Francisco" and "Francisca"
Generation pairs being matched, like "Willie Jr" and "Willie" (although this may potentially be resolved by proper data cleaning)
Transpositions of names are difficult to match, which is not uncommon with Asian names where the family name comes first

The first case is especially troublesome, since Jaro-Winkler tends to empahsize the initial characters and the edit distance is very small. Is it possible to implement custom string comparison functions, or to adjust the current default options to account for these cases? It would also help to have a name reweighting option for last names, since we could downweight the posterior matches of very common last names and reduce false positives. Thank you!

tedenamorado commented 3 years ago

Thanks for raising these important points! Being from Latin America, I know firsthand how difficult is to match based on names when name conventions do not accommodate well for the usual string distance comparators.

I think the first point you raised can be approached if your first try to separate observation by gender. If gender is not observed, you can use the gender function in R to try to assign gender manually. Of course, it is not perfect, but a place to start.

The second point is all about parsing the information contained in a name. probablepeople is a Python library that has been really useful to us when parsing names.

Finally, transpositions of a name are incredibly difficult, but again I feel if you train probablepeople on a few instances where the name comes in a reverse format, then you will be able to fix many of these instances.

We will be working hard this summer into addressing some of these issues within the fastLink framework.

Please, if anything, let us know.

All my best,

Ted

kosukeimai commented 3 years ago

@tedenamorado We should check out this package https://cran.r-project.org/web/packages/humaniformat/humaniformat.pdf It might be helpful for us.

tedenamorado commented 3 years ago

Indeed, humaniformat is a fantastic tool to parse names when the strings containing them have some structure. I will look into this.

aalexandersson commented 3 years ago

The package humaniformat was used with fastLink in this paper: Measuring Public Opinion via Digital Footprints (2019, page 7) .

Thanks again for all your work on fastLink, much appreciated!

aalexandersson commented 3 years ago

The newer R package peopleparser also looks interesting: https://github.com/Nonprofit-Open-Data-Collective/peopleparser .

jrtran commented 3 years ago

Love to see the active support here. I've worked a bit with the gender package before in a different context, so that should be worth a shot. I also thought probablepeople looked familiar, and it turns out that we've used the authors' address parser before. I might try the native R methods first, but integrating Python into our workflow is definitely an option. In any case, since the names in our data are already split up into fields, it looks like we might have to recombine them and then run the parser(s). Thank you all for the suggestions, and I'll close this issue as it seems like this problem should be addressed with better parsing rather than better string comparison functions.

kosukeimai / fastLink

Custom/adjusted string comparison functions #51