Closed MSusik closed 9 years ago
2). Chart showing the dependency between the recall and the size of the cluster.
For every groundtruth cluster, the recall was computed. The clusters were sorted and means were computed in the same way as in the previous point.
Y axis shows the means, X axis has no meaning, clusters are sorted. The blue line is b3_recall
, the green one is paired_recall
.
3). Checking the precision problems.
Here are the 50 worst- performing clusters in terms of b3_precision
ID | Size | b3 precision | paired precision | conclusion |
---|---|---|---|---|
13247 | 39 | 0.025641025641025654 | 0.0 | wrong groundtruth |
6254 | 35 | 0.028571428571428577 | 0.0 | wrong groundtruth |
7832 | 60 | 0.06222222222222219 | 0.04632768361581918 | block Johnson, R. needs futher investigation. Mix of works of Johnson, Rob , Johnson, Rolland , Johnson, Randy . (+ wrong groundtruth) |
10268 | 9 | 0.11111111111111113 | 0.0 | wrong groundtruth |
4346 | 26 | 0.16863905325443793 | 0.13538461538461544 | block Zhao, J. needs further investigation. Mix of works of Zhao, Jing Xia , Zhao, Jian-Ling and Zhao-Jie . |
6193 | 6 | 0.2222222222222222 | 0.06666666666666665 | block Wang, F. needs further investigation. Mix of works of Wang, Fang , Wang, Fang-Cong and Wang Feng . |
5979 | 11 | 0.23966942148760326 | 0.1636363636363637 | Four different people named Gang, Wang (source: googling). They work in completely different parts of the world (Quebec, California, Italy, Manchuria) |
5920 | 10 | 0.24 | 0.15555555555555556 | There are two people who wrote the same papers and are clustered together: Jiu.Qing.Wang.1 and Jiang.Wang.1 . |
6342 | 4 | 0.25 | 0.0 | Wang, Yadi , Wang Y.H. , Wang Y.P., Wang, Yiqun` |
3944 | 141 | 0.25768321513002357 | 0.2523809523809524 | W.Li.75 and Wei.Dong.Li.1 + wrong data |
6296 | 15 | 0.26222222222222225 | 0.20952380952380956 | 'Zheng.Zhi.Wang.1, Zheng.Qing.Wang.1, Zheng.Wang.1, Z.D.Wang.2and Zheng.Ben.Wang.1`. We might want to add more sofisticated features on names (or change the current ones). Also, the clustering might be the problem for this case (too low threshold). |
14897 | 32 | 0.263671875 | 0.23991935483870963 | 'Saibal, Mitra', 'Sanjit, Mitra', 'Sourav, Mitra', 'Subhadip, Mitra'. Note that two of them work for the same collaboration. |
7633 | 14 | 0.2653061224489796 | 0.20879120879120883 | 'Nakamrura, Yousuke', 'Nakamura, Yoshinobu', 'Nakamura, Y.', and some bad data |
6319 | 13 | 0.26627218934911245 | 0.20512820512820518 | Y.Z.Wang.2 , Yong.Hong.Wang.1 , Yun.Yong.Wang.1 |
4035 | 37 | 0.27684441197954673 | 0.2567567567567568 | N.Li.21 + Ning.Li.1 + bad data |
8362 | 134 | 0.29661394519937695 | 0.2913253282459881 | Lei.Li.1 + L.Li.1 + Li.Fang.Li.2 + Li.Li.2 + Li.Xin.Li.1 |
13309 | 10 | 0.3 | 0.2222222222222222 | another example of three guys with different second given names |
5986 | 30 | 0.30222222222222217 | 0.2781609195402299 | |
7244 | 25 | 0.30240000000000017 | 0.2733333333333333 | |
6420 | 14 | 0.32653061224489793 | 0.27472527472527475 | |
6393 | 17 | 0.32871972318339093 | 0.2867647058823529 | |
12629 | 17 | 0.328719723183391 | 0.2867647058823529 | |
6283 | 42 | 0.3299319727891157 | 0.313588850174216 | |
159 | 82 | 0.3328375966686492 | 0.3246010237880157 | |
98 | 6 | 0.3333333333333333 | 0.19999999999999996 | |
1707 | 3 | 0.3333333333333333 | 0.0 | |
6253 | 3 | 0.3333333333333333 | 0.0 | |
10473 | 6 | 0.3333333333333333 | 0.19999999999999996 | |
6391 | 6 | 0.3333333333333333 | 0.19999999999999996 | |
3228 | 14 | 0.336734693877551 | 0.2857142857142857 | |
7006 | 19 | 0.34072022160664817 | 0.30409356725146197 | |
3774 | 17 | 0.342560553633218 | 0.30147058823529416 | |
6163 | 8 | 0.34375 | 0.25 | |
4243 | 16 | 0.34375 | 0.30000000000000004 | |
6387 | 8 | 0.34375 | 0.25 | |
6002 | 16 | 0.34375 | 0.30000000000000004 | |
3244 | 38 | 0.3476454293628806 | 0.3300142247510669 | |
11992 | 179 | 0.34958334633750504 | 0.3459293201933338 | |
6022 | 15 | 0.3511111111111111 | 0.3047619047619048 | |
7836 | 35 | 0.3518367346938775 | 0.3327731092436975 | |
254 | 45 | 0.3550617283950619 | 0.3404040404040404 | |
6395 | 26 | 0.3579881656804733 | 0.3323076923076923 | |
12787 | 5 | 0.36 | 0.19999999999999996 | |
14555 | 30 | 0.3600000000000003 | 0.33793103448275863 | |
5253 | 51 | 0.3656286043829296 | 0.3529411764705882 | |
11076 | 16 | 0.3671875 | 0.32499999999999996 | |
1168 | 54 | 0.36899862825788754 | 0.3570929419986024 | |
6389 | 13 | 0.3727810650887574 | 0.3205128205128205 | |
5244 | 28 | 0.375 | 0.35185185185185186 | |
6416 | 4 | 0.375 | 0.16666666666666663 |
1). Adding Subject Category
feature.
2). Changing name features to depend more heavily on comparing only first given names, only second given names, only first inititals, only second initials
1). TBA 2). After adding those features, the score increased:
Number of blocks = 13114 True number of clusters 15575 Number of computed clusters 15517 B^3 F-score (overall) = 0.9822740637537467 B^3 F-score (train) = 0.9885050848645293 B^3 F-score (test) = 0.9818874209754536
(note that there was no references feature, no race feature, and coauthors feature was limited to adjacent coauthors).
1). After adding first given names Jaro-Winkler similarity, second given names Jaro-Winkler similarity and second initials equality features, my next idea was to add subject category feature.
The input files were the same as in @natsheh 's experiments.
Results didn't improve, though:
Number of blocks = 13114 True number of clusters 15575 Number of computed clusters 15583 (Precision, recall, f-score) - B^3 score (overall) = (0.9886385016454305, 0.9760140889071671, 0.9822857344672905) score (train) = (0.9974044814376634, 0.9792099539286123, 0.9882234783292826) score (test) = (0.9880124218626131, 0.9759391232459712, 0.9819386625400839)
Note that precision probably can't be improved. It might be the high time to switch back to my blocking method. Note also that according to the table in: https://github.com/inveniosoftware/beard/pull/35 , the highest recall that we can obtain with the LNFI blocking is 0.9816.
Here is the feature importances table:
Name | Importance |
---|---|
second_initial_equality | 2.29418443e-02 |
first_given_name_similarity | 5.27462494e-02 |
second_given_name_similarity | 2.48752032e-03 |
full_name_similarity | 1.57754884e-01 |
other_names_similarity | 1.34540621e-01 |
initials_similarity | 7.49790123e-08 |
affiliation_similarity | 1.31264315e-01 |
adjacent_coauthors_simil. | 1.92384182e-01 |
title_similarity | 6.66225566e-02 |
journal_similarity | 4.22976083e-02 |
abstract_similarity | 3.15329389e-02 |
keywords_similarity | 5.89478829e-02 |
subject_similarity | 3.81763655e-02 |
collaboration_similarity | 1.02636333e-02 |
year_difference | 5.80393237e-02 |
Due to the introduction of new features, initials_similarity
will be dropped. I hoped it might be still useful for cases where there are more than two given names, but apparently it is irrelevant.
The result of the clustering is available at /home/scarli/results/output_june_24.json
I can see that subject_similarity did not contribute enough to the importance. However, in increases a bit the overall evaluation. Let us see if we add ethnicities features how much we can gain. If initials_similarity does not consume any considerable time, I would vote for keeping it even it has a very very low importance.
Good news!
I run the algorithm using my blocking strategy with threshold = 1
(equivalent to splitting over first characters double metaphone result). Features surnames_similarity
and first_initial_equality
were added (They were obsolete in case of LNFI). The score improved.
[Parallel(n_jobs=-1)]: Done 10074 out of 10074 | elapsed: 208.0min finished Number of blocks = 10074 True number of clusters 15575 Number of computed clusters 14675 B^3 F-score (overall) = 0.9849559139067491 B^3 F-score (train) = 0.9906319911282302 B^3 F-score (test) = 0.9846093408117573
I will rerun the algorithm to check the precision and recall. Note that the pairs were sampled without considering equality of names as two different cases like before. Note also that the pairs were sampled without considering split between training and test set (the score might decrease a bit after taking this into account).
Precision, recall, f_score for the previous post.
B^3 F-score (overall) = 0.9857904093245792 0.984122830135025 0.9849559139067491 B^3 F-score (train) = 0.9952863025258638 0.986021007532837 0.9906319911282302 B^3 F-score (test) = 0.9851164759683962 0.9841027275299278 0.9846093408117573
Due to some invalid data indicated in the big summary table I decided to fix the input for D.Wang, R.Johnson, V.Visnjic and W.Li clusters.
In the table the ids for these blocks are 13247, 6254, 7832, 3944.
New clusters file is available at:
/home/scarli/clustersimproved.json
Results of the algorithm with correct pair sampling:
With surnames similarity feature
Number of blocks = 10074 True number of clusters 15388 Number of computed clusters 14683 B^3 F-score (overall) = 0.9852518017207982 0.9834729083475536 0.9843615513510563 B^3 F-score (train) = 0.996422982510764 0.9865477668374687 0.9914607853339493 B^3 F-score (test) = 0.9844047733854243 0.9833353660992127 0.9838697791470511
Without
Number of blocks = 10074 True number of clusters 15388 Number of computed clusters 14705 B^3 F-score (overall) = 0.9850478779929729 0.9831105615996569 0.9840782663174861 B^3 F-score (train) = 0.9965251672952843 0.9863770701902301 0.9914251507769879 B^3 F-score (test) = 0.9841691567953614 0.9829758605008408 0.9835721467134105
Result for blocking with double metaphone without threshold:
Number of blocks = 4797 True number of clusters 15388 Number of computed clusters 15699 B^3 F-score (overall) = 0.9773754185932135 0.9817767784121295 0.9795711545355091 B^3 F-score (train) = 0.9899992085911132 0.9866716219081106 0.9883326143702338 B^3 F-score (test) = 0.9766268174922395 0.981520234352044 0.9790674115885576
Result for blocking with nysiis with threshold=1
('Number of blocks =', 10804) ('True number of clusters', 15575) ('Number of computed clusters', 14995) ('B^3 F-score (overall) =', 0.9852947677142105, 0.981741997355113, 0.9835151741103332) ('B^3 F-score (train) =', 0.9964027111117821, 0.9851529076879183, 0.990745875378954) ('B^3 F-score (test) =', 0.9844895453357658, 0.981596829569563, 0.9830410594168337)
Slightly worse than double metaphone with threshold=1 (two posts above).
Closing. Results are summarized in http://arxiv.org/abs/1508.07744
The computations for error analysis were done on 1,2 mln signatures using
LNFI
blocking and default clustering strategy. The overallb3_f_score
for this strategy is0.98111
.1). Chart showing the dependency between the precision and size of the cluster.
For every predicted cluster, the precision was computed. The clusters were sorted and I computed for every 100 clusters (by this I mean ranges 0-99, 100-199, etc.) the mean of precision.
The y axis shows the means. X axis has no meaning, but it is important to note that the clusters are sorted. The smallest are on the left, the biggest on the right. You can notice that there are no errors for the smallest clusters (all of them contain only one signature). The blue line is
b3_precision
, the greenpaired_precision