Closed jrtran closed 3 years ago
Thanks for raising these important points! Being from Latin America, I know firsthand how difficult is to match based on names when name conventions do not accommodate well for the usual string distance comparators.
I think the first point you raised can be approached if your first try to separate observation by gender. If gender is not observed, you can use the gender
function in R to try to assign gender manually. Of course, it is not perfect, but a place to start.
The second point is all about parsing the information contained in a name. probablepeople
is a Python library that has been really useful to us when parsing names.
Finally, transpositions of a name are incredibly difficult, but again I feel if you train probablepeople
on a few instances where the name comes in a reverse format, then you will be able to fix many of these instances.
We will be working hard this summer into addressing some of these issues within the fastLink framework.
Please, if anything, let us know.
All my best,
Ted
@tedenamorado We should check out this package https://cran.r-project.org/web/packages/humaniformat/humaniformat.pdf It might be helpful for us.
Indeed, humaniformat
is a fantastic tool to parse names when the strings containing them have some structure. I will look into this.
The package humaniformat was used with fastLink in this paper: Measuring Public Opinion via Digital Footprints (2019, page 7) .
Thanks again for all your work on fastLink, much appreciated!
The newer R package peopleparser
also looks interesting: https://github.com/Nonprofit-Open-Data-Collective/peopleparser .
Love to see the active support here. I've worked a bit with the gender
package before in a different context, so that should be worth a shot. I also thought probablepeople
looked familiar, and it turns out that we've used the authors' address parser before. I might try the native R methods first, but integrating Python into our workflow is definitely an option. In any case, since the names in our data are already split up into fields, it looks like we might have to recombine them and then run the parser(s). Thank you all for the suggestions, and I'll close this issue as it seems like this problem should be addressed with better parsing rather than better string comparison functions.
Greatly appreciate the work on this package. Our data deals with a wide diversity of names (Hispanic, Asian, etc.), and we've found that the string distance methods included with fastLink have occasional issues:
The first case is especially troublesome, since Jaro-Winkler tends to empahsize the initial characters and the edit distance is very small. Is it possible to implement custom string comparison functions, or to adjust the current default options to account for these cases? It would also help to have a name reweighting option for last names, since we could downweight the posterior matches of very common last names and reduce false positives. Thank you!