Blocking with double-metaphone

MSusik commented 9 years ago

Ready for review!

MSusik commented 9 years ago

Some results:

Whole algorithm with the model trained on the blocks from the old algorithm. Old algorithm result:

Number of blocks = 9367
True number of clusters 10891
Number of computed clusters 11966
B^3 F-score (overall) = 0.968795345652
B^3 F-score (train) = 0.97601658209
B^3 F-score (test) = 0.968486484763

508.8 min

Results for the old blocking, without clustering:

b3: 0.9399
paired: 0.8860

New algotirhm results

Number of blocks = 4960
True number of clusters 10891
Number of computed clusters 11035
B^3 F-score (overall) = 0.963343468223
B^3 F-score (train) = 0.975473769019
B^3 F-score (test) = 0.962666989591

543.9 min

Note that if we train the distance model on the new clusters, the score for the new algorithm might be much better.

Recall for the blocking (preclustering) step. If the threshold is not specified, there is no threshold.

Algorithm	B^3 recall	paired recall
old	0.9816	0.9755
new	0.9976	0.9982
new (threshold 1000)	0.9942	0.9908
new (threshold 100)	0.9920	0.9885
new with surname split on uppercase¹	0.9973	0.9975
soundex instead of dm ²	0.9980	0.9984
nysiis instead of dm	0.9968	0.9971
dm from `fuzzy` package³	0.9978	0.9983
without any phonectic algorithm	0.9943	0.9932
without full multiple surnames match ⁴	0.9963	0.9966
take into account number of first names matched	0.9977	0.9982

¹ for example MacDonald -> (double_metaphone('mac'), double_metaphone('donald')) ² note that the clusters were significantly bigger ³ !!! note that two implementations of double metaphone differ ⁴ If there are two signatures with the same combination of multiple surnames appearing one after another, don't assign the second one to the same block as the first one, but instead assign the second signature as if there was no first one.

Results for new distance models. New algorithm, small dataset:

Number of blocks = 446
True number of clusters 887
Number of computed clusters 879
B^3 F-score (overall) = 0.972288059952
B^3 F-score (train) = 0.979280751524
B^3 F-score (test) = 0.971821768671

Results for the whole algorithm! New blocking, 700000 claimed papers, new distance model.

Number of blocks = 4957
True number of clusters 10891
Number of computed clusters 10494
B^3 F-score (overall) = 0.97078952847
B^3 F-score (train) = 0.981682821495
B^3 F-score (test) = 0.970217332014

glouppe commented 9 years ago

General comment about naming and code organization: I find the term preclustering a bit too vague. We should emphasize that your algorithm is meant to be a blocking function, as defined in BlockClusterer.

What would you think of the following file organisation:

blocking.py : BlockClusterer, SingleClusterer
blocking_funcs.py :
- block_single (moved from blocking._single)
- block_last_name_first_initial (last name, first initial blocking, moved from the example)
- block_double_metaphone (your function)

and then import all block_* functions from blocking_funcs at the module level.

MSusik commented 9 years ago

I fixed few things, improved docstrings and added result for algorithm without using any phonetic algorithm (see the table).

glouppe commented 9 years ago

To make the code easier to understand for others, I think it would be quite helpful to use a limited and defined vocabulary when talking about names (surnames, first names, last names, given names, family names, etc)

glouppe commented 9 years ago

Besides my comments regarding naming, this looks very fine with me. Thanks for the great work!

Just waiting for the results to give my +1

glouppe commented 9 years ago

As per #41, I had to move tests into a separate directory in order to fix the Travis build. Unfortunately, this breaks your PR... Could you move your tests around in tests/? Thanks :)

glouppe commented 9 years ago

You can also squash all commits

inspirehep / beard

Blocking with double-metaphone #35