clustering: simplified double_metaphone blocking

MSusik commented 9 years ago

Signed-off-by: Mateusz Susik mateusz.susik@cern.ch

MSusik commented 9 years ago

I decided to get rid of magical constants and simplify the code.

Some recalls (recall for LNFI - 0.9815)

Merge to the first surname - In case of multiple surnames, always assign a signature to block of the first surname

Merge to the last surname - In case of multiple surnames, always assign a signature to block of the last surname

Strategy	Merge to first surname	Merge to last surname	Previous strategy	This PR
threshold
1	0.9902	0.9894	0.9911	0.9907
1000	0.992	0.9913	0.993	0.9927
no	0.9961	0.995	0.997	0.9966

EDIT: when I split the data into the training and test sets, there is only one difference in results. In case of threshold set to 1000, the score on the training set is 0,9943. It makes sense - when the data is smaller, fewer blocks are split.

glouppe commented 9 years ago

That's great :)

+1 for merge once you have fixed Travis

MSusik commented 9 years ago

Ready to merge.

glouppe commented 9 years ago

Merging, thanks a lot

inspirehep / beard

clustering: simplified double_metaphone blocking #66