results: error analysis

MSusik commented 9 years ago

The computations for error analysis were done on 1,2 mln signatures using LNFI blocking and default clustering strategy. The overall b3_f_score for this strategy is 0.98111.

1). Chart showing the dependency between the precision and size of the cluster.

For every predicted cluster, the precision was computed. The clusters were sorted and I computed for every 100 clusters (by this I mean ranges 0-99, 100-199, etc.) the mean of precision.

The y axis shows the means. X axis has no meaning, but it is important to note that the clusters are sorted. The smallest are on the left, the biggest on the right. You can notice that there are no errors for the smallest clusters (all of them contain only one signature). The blue line is b3_precision, the green paired_precision

figure_2

MSusik commented 9 years ago

2). Chart showing the dependency between the recall and the size of the cluster.

For every groundtruth cluster, the recall was computed. The clusters were sorted and means were computed in the same way as in the previous point.

Y axis shows the means, X axis has no meaning, clusters are sorted. The blue line is b3_recall, the green one is paired_recall.

figure_3

MSusik commented 9 years ago

3). Checking the precision problems.

Here are the 50 worst- performing clusters in terms of b3_precision

ID	Size	b3 precision	paired precision	conclusion
13247	39	0.025641025641025654	0.0	wrong groundtruth
6254	35	0.028571428571428577	0.0	wrong groundtruth
7832	60	0.06222222222222219	0.04632768361581918	block `Johnson, R.` needs futher investigation. Mix of works of `Johnson, Rob`, `Johnson, Rolland`, `Johnson, Randy`. (+ wrong groundtruth)
10268	9	0.11111111111111113	0.0	wrong groundtruth
4346	26	0.16863905325443793	0.13538461538461544	block `Zhao, J.` needs further investigation. Mix of works of `Zhao, Jing Xia`, `Zhao, Jian-Ling` and `Zhao-Jie`.
6193	6	0.2222222222222222	0.06666666666666665	block `Wang, F.` needs further investigation. Mix of works of `Wang, Fang`, `Wang, Fang-Cong` and `Wang Feng`.
5979	11	0.23966942148760326	0.1636363636363637	Four different people named `Gang, Wang`(source: googling). They work in completely different parts of the world (Quebec, California, Italy, Manchuria)
5920	10	0.24	0.15555555555555556	There are two people who wrote the same papers and are clustered together: `Jiu.Qing.Wang.1` and `Jiang.Wang.1`.
6342	4	0.25	0.0	`Wang, Yadi`, `Wang Y.H.`, Wang Y.P.`,`Wang, Yiqun`
3944	141	0.25768321513002357	0.2523809523809524	`W.Li.75` and `Wei.Dong.Li.1` + wrong data
6296	15	0.26222222222222225	0.20952380952380956	'Zheng.Zhi.Wang.1`,`Zheng.Qing.Wang.1`,`Zheng.Wang.1`,`Z.D.Wang.2`and`Zheng.Ben.Wang.1`. We might want to add more sofisticated features on names (or change the current ones). Also, the clustering might be the problem for this case (too low threshold).
14897	32	0.263671875	0.23991935483870963	'Saibal, Mitra', 'Sanjit, Mitra', 'Sourav, Mitra', 'Subhadip, Mitra'. Note that two of them work for the same collaboration.
7633	14	0.2653061224489796	0.20879120879120883	'Nakamrura, Yousuke', 'Nakamura, Yoshinobu', 'Nakamura, Y.', and some bad data
6319	13	0.26627218934911245	0.20512820512820518	`Y.Z.Wang.2`, `Yong.Hong.Wang.1`, `Yun.Yong.Wang.1`
4035	37	0.27684441197954673	0.2567567567567568	`N.Li.21` + `Ning.Li.1` + bad data
8362	134	0.29661394519937695	0.2913253282459881	`Lei.Li.1` + `L.Li.1` + `Li.Fang.Li.2` + `Li.Li.2` + `Li.Xin.Li.1`
13309	10	0.3	0.2222222222222222	another example of three guys with different second given names
5986	30	0.30222222222222217	0.2781609195402299
7244	25	0.30240000000000017	0.2733333333333333
6420	14	0.32653061224489793	0.27472527472527475
6393	17	0.32871972318339093	0.2867647058823529
12629	17	0.328719723183391	0.2867647058823529
6283	42	0.3299319727891157	0.313588850174216
159	82	0.3328375966686492	0.3246010237880157
98	6	0.3333333333333333	0.19999999999999996
1707	3	0.3333333333333333	0.0
6253	3	0.3333333333333333	0.0
10473	6	0.3333333333333333	0.19999999999999996
6391	6	0.3333333333333333	0.19999999999999996
3228	14	0.336734693877551	0.2857142857142857
7006	19	0.34072022160664817	0.30409356725146197
3774	17	0.342560553633218	0.30147058823529416
6163	8	0.34375	0.25
4243	16	0.34375	0.30000000000000004
6387	8	0.34375	0.25
6002	16	0.34375	0.30000000000000004
3244	38	0.3476454293628806	0.3300142247510669
11992	179	0.34958334633750504	0.3459293201933338
6022	15	0.3511111111111111	0.3047619047619048
7836	35	0.3518367346938775	0.3327731092436975
254	45	0.3550617283950619	0.3404040404040404
6395	26	0.3579881656804733	0.3323076923076923
12787	5	0.36	0.19999999999999996
14555	30	0.3600000000000003	0.33793103448275863
5253	51	0.3656286043829296	0.3529411764705882
11076	16	0.3671875	0.32499999999999996
1168	54	0.36899862825788754	0.3570929419986024
6389	13	0.3727810650887574	0.3205128205128205
5244	28	0.375	0.35185185185185186
6416	4	0.375	0.16666666666666663

MSusik commented 9 years ago

Things to try:

1). Adding Subject Category feature. 2). Changing name features to depend more heavily on comparing only first given names, only second given names, only first inititals, only second initials

MSusik commented 9 years ago

Results:

1). TBA 2). After adding those features, the score increased:

Number of blocks = 13114 True number of clusters 15575 Number of computed clusters 15517 B^3 F-score (overall) = 0.9822740637537467 B^3 F-score (train) = 0.9885050848645293 B^3 F-score (test) = 0.9818874209754536

(note that there was no references feature, no race feature, and coauthors feature was limited to adjacent coauthors).

MSusik commented 9 years ago

Results part 2:

1). After adding first given names Jaro-Winkler similarity, second given names Jaro-Winkler similarity and second initials equality features, my next idea was to add subject category feature.

The input files were the same as in @natsheh 's experiments.

Results didn't improve, though:

Number of blocks = 13114 True number of clusters 15575 Number of computed clusters 15583 (Precision, recall, f-score) - B^3 score (overall) = (0.9886385016454305, 0.9760140889071671, 0.9822857344672905) score (train) = (0.9974044814376634, 0.9792099539286123, 0.9882234783292826) score (test) = (0.9880124218626131, 0.9759391232459712, 0.9819386625400839)

Note that precision probably can't be improved. It might be the high time to switch back to my blocking method. Note also that according to the table in: https://github.com/inveniosoftware/beard/pull/35 , the highest recall that we can obtain with the LNFI blocking is 0.9816.

Here is the feature importances table:

Name	Importance
second_initial_equality	2.29418443e-02
first_given_name_similarity	5.27462494e-02
second_given_name_similarity	2.48752032e-03
full_name_similarity	1.57754884e-01
other_names_similarity	1.34540621e-01
initials_similarity	7.49790123e-08
affiliation_similarity	1.31264315e-01
adjacent_coauthors_simil.	1.92384182e-01
title_similarity	6.66225566e-02
journal_similarity	4.22976083e-02
abstract_similarity	3.15329389e-02
keywords_similarity	5.89478829e-02
subject_similarity	3.81763655e-02
collaboration_similarity	1.02636333e-02
year_difference	5.80393237e-02

Due to the introduction of new features, initials_similarity will be dropped. I hoped it might be still useful for cases where there are more than two given names, but apparently it is irrelevant.

The result of the clustering is available at /home/scarli/results/output_june_24.json

natsheh commented 9 years ago

I can see that subject_similarity did not contribute enough to the importance. However, in increases a bit the overall evaluation. Let us see if we add ethnicities features how much we can gain. If initials_similarity does not consume any considerable time, I would vote for keeping it even it has a very very low importance.

MSusik commented 9 years ago

Good news!

I run the algorithm using my blocking strategy with threshold = 1 (equivalent to splitting over first characters double metaphone result). Features surnames_similarity and first_initial_equality were added (They were obsolete in case of LNFI). The score improved.

[Parallel(n_jobs=-1)]: Done 10074 out of 10074 | elapsed: 208.0min finished Number of blocks = 10074 True number of clusters 15575 Number of computed clusters 14675 B^3 F-score (overall) = 0.9849559139067491 B^3 F-score (train) = 0.9906319911282302 B^3 F-score (test) = 0.9846093408117573

I will rerun the algorithm to check the precision and recall. Note that the pairs were sampled without considering equality of names as two different cases like before. Note also that the pairs were sampled without considering split between training and test set (the score might decrease a bit after taking this into account).

MSusik commented 9 years ago

Precision, recall, f_score for the previous post.

B^3 F-score (overall) = 0.9857904093245792 0.984122830135025 0.9849559139067491 B^3 F-score (train) = 0.9952863025258638 0.986021007532837 0.9906319911282302 B^3 F-score (test) = 0.9851164759683962 0.9841027275299278 0.9846093408117573

MSusik commented 9 years ago

Due to some invalid data indicated in the big summary table I decided to fix the input for D.Wang, R.Johnson, V.Visnjic and W.Li clusters.

In the table the ids for these blocks are 13247, 6254, 7832, 3944.

New clusters file is available at:

/home/scarli/clustersimproved.json

MSusik commented 9 years ago

Results of the algorithm with correct pair sampling:

With surnames similarity feature

Number of blocks = 10074 True number of clusters 15388 Number of computed clusters 14683 B^3 F-score (overall) = 0.9852518017207982 0.9834729083475536 0.9843615513510563 B^3 F-score (train) = 0.996422982510764 0.9865477668374687 0.9914607853339493 B^3 F-score (test) = 0.9844047733854243 0.9833353660992127 0.9838697791470511

Without

Number of blocks = 10074 True number of clusters 15388 Number of computed clusters 14705 B^3 F-score (overall) = 0.9850478779929729 0.9831105615996569 0.9840782663174861 B^3 F-score (train) = 0.9965251672952843 0.9863770701902301 0.9914251507769879 B^3 F-score (test) = 0.9841691567953614 0.9829758605008408 0.9835721467134105

MSusik commented 9 years ago

Result for blocking with double metaphone without threshold:

Number of blocks = 4797 True number of clusters 15388 Number of computed clusters 15699 B^3 F-score (overall) = 0.9773754185932135 0.9817767784121295 0.9795711545355091 B^3 F-score (train) = 0.9899992085911132 0.9866716219081106 0.9883326143702338 B^3 F-score (test) = 0.9766268174922395 0.981520234352044 0.9790674115885576

MSusik commented 9 years ago

Result for blocking with nysiis with threshold=1

('Number of blocks =', 10804) ('True number of clusters', 15575) ('Number of computed clusters', 14995) ('B^3 F-score (overall) =', 0.9852947677142105, 0.981741997355113, 0.9835151741103332) ('B^3 F-score (train) =', 0.9964027111117821, 0.9851529076879183, 0.990745875378954) ('B^3 F-score (test) =', 0.9844895453357658, 0.981596829569563, 0.9830410594168337)

Slightly worse than double metaphone with threshold=1 (two posts above).

glouppe commented 9 years ago

Closing. Results are summarized in http://arxiv.org/abs/1508.07744

inspirehep / beard