ropensci / unconf17

Website for 2017 rOpenSci Unconf
http://unconf17.ropensci.org
64 stars 12 forks source link

Tidy record linkage package #98

Open 1danjordan opened 7 years ago

1danjordan commented 7 years ago

Disparate or weakly linked data makes up the majority of the worlds data, but we focus mainly on single source datasets or combining datasets with definite primary and foreign keys. A number of tidyverse compliant packages exist for data cleansing and transformation but not for deduplication or record linkage. The problem of record linkage is complex and well studied, but there are no tools or framework that fits nicely into a modern R workflow.

The RecordLinkage package is a brilliant package that does solve this problem, but its API is inconsistent and data structures awkward. A tidy record linkage package could build from the lessons learned from RecordLinkage, while adhering to the "tidy way of life" and integrating with other tidy tools nicely. I think a package like this could open up a lot of possibilities for researchers and practitioners to working with and combing data they never could before.

ck37 commented 7 years ago

How do you feel about the fastLink package?