Closed MSusik closed 9 years ago
Some results:
Whole algorithm with the model trained on the blocks from the old algorithm. Old algorithm result:
Number of blocks = 9367
True number of clusters 10891
Number of computed clusters 11966
B^3 F-score (overall) = 0.968795345652
B^3 F-score (train) = 0.97601658209
B^3 F-score (test) = 0.968486484763
508.8 min
Results for the old blocking, without clustering:
b3: 0.9399
paired: 0.8860
New algotirhm results
Number of blocks = 4960
True number of clusters 10891
Number of computed clusters 11035
B^3 F-score (overall) = 0.963343468223
B^3 F-score (train) = 0.975473769019
B^3 F-score (test) = 0.962666989591
543.9 min
Note that if we train the distance model on the new clusters, the score for the new algorithm might be much better.
Recall for the blocking (preclustering) step. If the threshold is not specified, there is no threshold.
Algorithm | B^3 recall | paired recall |
---|---|---|
old | 0.9816 | 0.9755 |
new | 0.9976 | 0.9982 |
new (threshold 1000) | 0.9942 | 0.9908 |
new (threshold 100) | 0.9920 | 0.9885 |
new with surname split on uppercase1 | 0.9973 | 0.9975 |
soundex instead of dm 2 | 0.9980 | 0.9984 |
nysiis instead of dm | 0.9968 | 0.9971 |
dm from fuzzy package3 |
0.9978 | 0.9983 |
without any phonectic algorithm | 0.9943 | 0.9932 |
without full multiple surnames match 4 | 0.9963 | 0.9966 |
take into account number of first names matched | 0.9977 | 0.9982 |
1 for example MacDonald -> (double_metaphone('mac'), double_metaphone('donald'))
2 note that the clusters were significantly bigger
3 !!! note that two implementations of double metaphone differ
4 If there are two signatures with the same combination of multiple surnames appearing one after another, don't assign the second one to the same block as the first one, but instead assign the second signature as if there was no first one.
Results for new distance models. New algorithm, small dataset:
Number of blocks = 446
True number of clusters 887
Number of computed clusters 879
B^3 F-score (overall) = 0.972288059952
B^3 F-score (train) = 0.979280751524
B^3 F-score (test) = 0.971821768671
Results for the whole algorithm! New blocking, 700000 claimed papers, new distance model.
Number of blocks = 4957
True number of clusters 10891
Number of computed clusters 10494
B^3 F-score (overall) = 0.97078952847
B^3 F-score (train) = 0.981682821495
B^3 F-score (test) = 0.970217332014
General comment about naming and code organization: I find the term preclustering
a bit too vague. We should emphasize that your algorithm is meant to be a blocking function, as defined in BlockClusterer.
What would you think of the following file organisation:
and then import all block_*
functions from blocking_funcs at the module level.
I fixed few things, improved docstrings and added result for algorithm without using any phonetic algorithm (see the table).
To make the code easier to understand for others, I think it would be quite helpful to use a limited and defined vocabulary when talking about names (surnames, first names, last names, given names, family names, etc)
Besides my comments regarding naming, this looks very fine with me. Thanks for the great work!
Just waiting for the results to give my +1
As per #41, I had to move tests into a separate directory in order to fix the Travis build. Unfortunately, this breaks your PR... Could you move your tests around in tests/
? Thanks :)
You can also squash all commits
Ready for review!