moj-analytical-services / splink_demos

Interactive notebooks containing demonstration code of the splink library
38 stars 27 forks source link

Data labelling/annotation #17

Open alicja-januszkiewicz opened 3 years ago

alicja-januszkiewicz commented 3 years ago

I was wondering whether it'd be feasible and appropriate to include some discussion and overview of how would one go about to generate a labelled dataset for the purpose of model evaluation. In particular, how would one sample the dataset to ensure an appropriate number of true matches are included, while at the same time keeping the sample representative? Could splink assist in this process via blocking?

As a user coming over from performing deterministic record linkage I'm somewhat familiar with the typical linkage workflow (cleaning, indexing, calculating comparison vectors, classifying) but the evaluation step is new and presents a challenge. Perhaps some references to discussions on this topic would be helpful to new users like me?