inspirehep / beard

Bibliographic Entity Automatic Recognition and Disambiguation
Other
66 stars 36 forks source link

results: error analysis #51

Closed MSusik closed 9 years ago

MSusik commented 9 years ago

The computations for error analysis were done on 1,2 mln signatures using LNFI blocking and default clustering strategy. The overall b3_f_score for this strategy is 0.98111.

1). Chart showing the dependency between the precision and size of the cluster.

For every predicted cluster, the precision was computed. The clusters were sorted and I computed for every 100 clusters (by this I mean ranges 0-99, 100-199, etc.) the mean of precision.

The y axis shows the means. X axis has no meaning, but it is important to note that the clusters are sorted. The smallest are on the left, the biggest on the right. You can notice that there are no errors for the smallest clusters (all of them contain only one signature). The blue line is b3_precision, the green paired_precision

figure_2

MSusik commented 9 years ago

2). Chart showing the dependency between the recall and the size of the cluster.

For every groundtruth cluster, the recall was computed. The clusters were sorted and means were computed in the same way as in the previous point.

Y axis shows the means, X axis has no meaning, clusters are sorted. The blue line is b3_recall, the green one is paired_recall.

figure_3

MSusik commented 9 years ago

3). Checking the precision problems.

Here are the 50 worst- performing clusters in terms of b3_precision

ID Size b3 precision paired precision conclusion
13247 39 0.025641025641025654 0.0 wrong groundtruth
6254 35 0.028571428571428577 0.0 wrong groundtruth
7832 60 0.06222222222222219 0.04632768361581918 block Johnson, R. needs futher investigation. Mix of works of Johnson, Rob, Johnson, Rolland, Johnson, Randy. (+ wrong groundtruth)
10268 9 0.11111111111111113 0.0 wrong groundtruth
4346 26 0.16863905325443793 0.13538461538461544 block Zhao, J. needs further investigation. Mix of works of Zhao, Jing Xia, Zhao, Jian-Ling and Zhao-Jie.
6193 6 0.2222222222222222 0.06666666666666665 block Wang, F. needs further investigation. Mix of works of Wang, Fang, Wang, Fang-Cong and Wang Feng.
5979 11 0.23966942148760326 0.1636363636363637 Four different people named Gang, Wang(source: googling). They work in completely different parts of the world (Quebec, California, Italy, Manchuria)
5920 10 0.24 0.15555555555555556 There are two people who wrote the same papers and are clustered together: Jiu.Qing.Wang.1 and Jiang.Wang.1.
6342 4 0.25 0.0 Wang, Yadi, Wang Y.H., Wang Y.P.,Wang, Yiqun`
3944 141 0.25768321513002357 0.2523809523809524 W.Li.75 and Wei.Dong.Li.1 + wrong data
6296 15 0.26222222222222225 0.20952380952380956 'Zheng.Zhi.Wang.1,Zheng.Qing.Wang.1,Zheng.Wang.1,Z.D.Wang.2andZheng.Ben.Wang.1`. We might want to add more sofisticated features on names (or change the current ones). Also, the clustering might be the problem for this case (too low threshold).
14897 32 0.263671875 0.23991935483870963 'Saibal, Mitra', 'Sanjit, Mitra', 'Sourav, Mitra', 'Subhadip, Mitra'. Note that two of them work for the same collaboration.
7633 14 0.2653061224489796 0.20879120879120883 'Nakamrura, Yousuke', 'Nakamura, Yoshinobu', 'Nakamura, Y.', and some bad data
6319 13 0.26627218934911245 0.20512820512820518 Y.Z.Wang.2, Yong.Hong.Wang.1, Yun.Yong.Wang.1
4035 37 0.27684441197954673 0.2567567567567568 N.Li.21 + Ning.Li.1 + bad data
8362 134 0.29661394519937695 0.2913253282459881 Lei.Li.1 + L.Li.1 + Li.Fang.Li.2 + Li.Li.2 + Li.Xin.Li.1
13309 10 0.3 0.2222222222222222 another example of three guys with different second given names
5986 30 0.30222222222222217 0.2781609195402299
7244 25 0.30240000000000017 0.2733333333333333
6420 14 0.32653061224489793 0.27472527472527475
6393 17 0.32871972318339093 0.2867647058823529
12629 17 0.328719723183391 0.2867647058823529
6283 42 0.3299319727891157 0.313588850174216
159 82 0.3328375966686492 0.3246010237880157
98 6 0.3333333333333333 0.19999999999999996
1707 3 0.3333333333333333 0.0
6253 3 0.3333333333333333 0.0
10473 6 0.3333333333333333 0.19999999999999996
6391 6 0.3333333333333333 0.19999999999999996
3228 14 0.336734693877551 0.2857142857142857
7006 19 0.34072022160664817 0.30409356725146197
3774 17 0.342560553633218 0.30147058823529416
6163 8 0.34375 0.25
4243 16 0.34375 0.30000000000000004
6387 8 0.34375 0.25
6002 16 0.34375 0.30000000000000004
3244 38 0.3476454293628806 0.3300142247510669
11992 179 0.34958334633750504 0.3459293201933338
6022 15 0.3511111111111111 0.3047619047619048
7836 35 0.3518367346938775 0.3327731092436975
254 45 0.3550617283950619 0.3404040404040404
6395 26 0.3579881656804733 0.3323076923076923
12787 5 0.36 0.19999999999999996
14555 30 0.3600000000000003 0.33793103448275863
5253 51 0.3656286043829296 0.3529411764705882
11076 16 0.3671875 0.32499999999999996
1168 54 0.36899862825788754 0.3570929419986024
6389 13 0.3727810650887574 0.3205128205128205
5244 28 0.375 0.35185185185185186
6416 4 0.375 0.16666666666666663
MSusik commented 9 years ago

Things to try:

1). Adding Subject Category feature. 2). Changing name features to depend more heavily on comparing only first given names, only second given names, only first inititals, only second initials

MSusik commented 9 years ago

Results:

1). TBA 2). After adding those features, the score increased:

Number of blocks = 13114 True number of clusters 15575 Number of computed clusters 15517 B^3 F-score (overall) = 0.9822740637537467 B^3 F-score (train) = 0.9885050848645293 B^3 F-score (test) = 0.9818874209754536

(note that there was no references feature, no race feature, and coauthors feature was limited to adjacent coauthors).

MSusik commented 9 years ago

Results part 2:

1). After adding first given names Jaro-Winkler similarity, second given names Jaro-Winkler similarity and second initials equality features, my next idea was to add subject category feature.

The input files were the same as in @natsheh 's experiments.

Results didn't improve, though:

Number of blocks = 13114 True number of clusters 15575 Number of computed clusters 15583 (Precision, recall, f-score) - B^3 score (overall) = (0.9886385016454305, 0.9760140889071671, 0.9822857344672905) score (train) = (0.9974044814376634, 0.9792099539286123, 0.9882234783292826) score (test) = (0.9880124218626131, 0.9759391232459712, 0.9819386625400839)

Note that precision probably can't be improved. It might be the high time to switch back to my blocking method. Note also that according to the table in: https://github.com/inveniosoftware/beard/pull/35 , the highest recall that we can obtain with the LNFI blocking is 0.9816.

Here is the feature importances table:

Name Importance
second_initial_equality 2.29418443e-02
first_given_name_similarity 5.27462494e-02
second_given_name_similarity 2.48752032e-03
full_name_similarity 1.57754884e-01
other_names_similarity 1.34540621e-01
initials_similarity 7.49790123e-08
affiliation_similarity 1.31264315e-01
adjacent_coauthors_simil. 1.92384182e-01
title_similarity 6.66225566e-02
journal_similarity 4.22976083e-02
abstract_similarity 3.15329389e-02
keywords_similarity 5.89478829e-02
subject_similarity 3.81763655e-02
collaboration_similarity 1.02636333e-02
year_difference 5.80393237e-02

Due to the introduction of new features, initials_similarity will be dropped. I hoped it might be still useful for cases where there are more than two given names, but apparently it is irrelevant.

The result of the clustering is available at /home/scarli/results/output_june_24.json

natsheh commented 9 years ago

I can see that subject_similarity did not contribute enough to the importance. However, in increases a bit the overall evaluation. Let us see if we add ethnicities features how much we can gain. If initials_similarity does not consume any considerable time, I would vote for keeping it even it has a very very low importance.

MSusik commented 9 years ago

Good news!

I run the algorithm using my blocking strategy with threshold = 1 (equivalent to splitting over first characters double metaphone result). Features surnames_similarity and first_initial_equality were added (They were obsolete in case of LNFI). The score improved.

[Parallel(n_jobs=-1)]: Done 10074 out of 10074 | elapsed: 208.0min finished Number of blocks = 10074 True number of clusters 15575 Number of computed clusters 14675 B^3 F-score (overall) = 0.9849559139067491 B^3 F-score (train) = 0.9906319911282302 B^3 F-score (test) = 0.9846093408117573

I will rerun the algorithm to check the precision and recall. Note that the pairs were sampled without considering equality of names as two different cases like before. Note also that the pairs were sampled without considering split between training and test set (the score might decrease a bit after taking this into account).

MSusik commented 9 years ago

Precision, recall, f_score for the previous post.

B^3 F-score (overall) = 0.9857904093245792 0.984122830135025 0.9849559139067491 B^3 F-score (train) = 0.9952863025258638 0.986021007532837 0.9906319911282302 B^3 F-score (test) = 0.9851164759683962 0.9841027275299278 0.9846093408117573

MSusik commented 9 years ago

Due to some invalid data indicated in the big summary table I decided to fix the input for D.Wang, R.Johnson, V.Visnjic and W.Li clusters.

In the table the ids for these blocks are 13247, 6254, 7832, 3944.

New clusters file is available at:

/home/scarli/clustersimproved.json

MSusik commented 9 years ago

Results of the algorithm with correct pair sampling:

With surnames similarity feature

Number of blocks = 10074 True number of clusters 15388 Number of computed clusters 14683 B^3 F-score (overall) = 0.9852518017207982 0.9834729083475536 0.9843615513510563 B^3 F-score (train) = 0.996422982510764 0.9865477668374687 0.9914607853339493 B^3 F-score (test) = 0.9844047733854243 0.9833353660992127 0.9838697791470511

Without

Number of blocks = 10074 True number of clusters 15388 Number of computed clusters 14705 B^3 F-score (overall) = 0.9850478779929729 0.9831105615996569 0.9840782663174861 B^3 F-score (train) = 0.9965251672952843 0.9863770701902301 0.9914251507769879 B^3 F-score (test) = 0.9841691567953614 0.9829758605008408 0.9835721467134105

MSusik commented 9 years ago

Result for blocking with double metaphone without threshold:

Number of blocks = 4797 True number of clusters 15388 Number of computed clusters 15699 B^3 F-score (overall) = 0.9773754185932135 0.9817767784121295 0.9795711545355091 B^3 F-score (train) = 0.9899992085911132 0.9866716219081106 0.9883326143702338 B^3 F-score (test) = 0.9766268174922395 0.981520234352044 0.9790674115885576

MSusik commented 9 years ago

Result for blocking with nysiis with threshold=1

('Number of blocks =', 10804) ('True number of clusters', 15575) ('Number of computed clusters', 14995) ('B^3 F-score (overall) =', 0.9852947677142105, 0.981741997355113, 0.9835151741103332) ('B^3 F-score (train) =', 0.9964027111117821, 0.9851529076879183, 0.990745875378954) ('B^3 F-score (test) =', 0.9844895453357658, 0.981596829569563, 0.9830410594168337)

Slightly worse than double metaphone with threshold=1 (two posts above).

glouppe commented 9 years ago

Closing. Results are summarized in http://arxiv.org/abs/1508.07744