inspirehep / beard

Bibliographic Entity Automatic Recognition and Disambiguation
Other
66 stars 36 forks source link

clustering: simplified double_metaphone blocking #66

Closed MSusik closed 9 years ago

MSusik commented 9 years ago

Signed-off-by: Mateusz Susik mateusz.susik@cern.ch

MSusik commented 9 years ago

I decided to get rid of magical constants and simplify the code.

Some recalls (recall for LNFI - 0.9815)

Merge to the first surname - In case of multiple surnames, always assign a signature to block of the first surname

Merge to the last surname - In case of multiple surnames, always assign a signature to block of the last surname

Strategy Merge to first surname Merge to last surname Previous strategy This PR
threshold
1 0.9902 0.9894 0.9911 0.9907
1000 0.992 0.9913 0.993 0.9927
no 0.9961 0.995 0.997 0.9966

EDIT: when I split the data into the training and test sets, there is only one difference in results. In case of threshold set to 1000, the score on the training set is 0,9943. It makes sense - when the data is smaller, fewer blocks are split.

glouppe commented 9 years ago

That's great :)

+1 for merge once you have fixed Travis

MSusik commented 9 years ago

Ready to merge.

glouppe commented 9 years ago

Merging, thanks a lot