larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
613 stars 194 forks source link

Deduplication and record linkage #259

Open xinelim opened 5 years ago

xinelim commented 5 years ago

Hi, Given two sets of datasets, is it possible that I deduplicate each dataset and then perform record linkage across two datasets? Please advise.

uderline commented 5 years ago

Hi !

You will need to do this step by step: deduplicate each dataset individually (in a new file for example) and then link them. There is no way of doing those at the same time.

I sort of wanted to do something like you at one point using Python by getting the matches/links from the console with the command java no.priv.garshol.Duke .... config.xml. It was a waste of time, you should go directly with Java and use the MatchListener classes and maybe make your own if you need to.