Disparate or weakly linked data makes up the majority of the worlds data, but we focus mainly on single source datasets or combining datasets with definite primary and foreign keys. A number of tidyverse compliant packages exist for data cleansing and transformation but not for deduplication or record linkage. The problem of record linkage is complex and well studied, but there are no tools or framework that fits nicely into a modern R workflow.
The RecordLinkage package is a brilliant package that does solve this problem, but its API is inconsistent and data structures awkward. A tidy record linkage package could build from the lessons learned from RecordLinkage, while adhering to the "tidy way of life" and integrating with other tidy tools nicely. I think a package like this could open up a lot of possibilities for researchers and practitioners to working with and combing data they never could before.
Disparate or weakly linked data makes up the majority of the worlds data, but we focus mainly on single source datasets or combining datasets with definite primary and foreign keys. A number of tidyverse compliant packages exist for data cleansing and transformation but not for deduplication or record linkage. The problem of record linkage is complex and well studied, but there are no tools or framework that fits nicely into a modern R workflow.
The RecordLinkage package is a brilliant package that does solve this problem, but its API is inconsistent and data structures awkward. A tidy record linkage package could build from the lessons learned from RecordLinkage, while adhering to the "tidy way of life" and integrating with other tidy tools nicely. I think a package like this could open up a lot of possibilities for researchers and practitioners to working with and combing data they never could before.