wcmc-its / ReCiter

ReCiter: an enterprise open source author disambiguation system for academic institutions
Apache License 2.0
45 stars 25 forks source link

journalDepartmentCategory scoring should pick most favorable match #355

Closed paulalbert1 closed 5 years ago

paulalbert1 commented 5 years ago

For blhempst, almost everything is a match against this faculty's primary department (Neuroscience). Else, NO_MATCH.

Current state

pmid journalSubfieldScienceMetrixLabel journalSubfieldScienceMetrixID matchingDepartment-current subfieldScore-Current
26282324 Neurology & Neurosurgery 113 Neuroscience 2.52
29909994 Neurology & Neurosurgery 113 Neuroscience 2.52
26311773 Neurology & Neurosurgery 113 Neuroscience 2.52
16472198 Neurology & Neurosurgery 113 Neuroscience 2.52
18322085 Neurology & Neurosurgery 113 Neuroscience 2.52
30219601 Neurology & Neurosurgery 113 Neuroscience 2.52
19526280 Neurology & Neurosurgery 113 Neuroscience 2.52
15067313 Immunology 111 NO_MATCH -1
14702107 Immunology 111 NO_MATCH -1
19136973 Neurology & Neurosurgery 113 Neuroscience 2.52
26004511 Developmental Biology 88 NO_MATCH -1
15831817 Cardiovascular System & Hematology 100 NO_MATCH -1
15765148 Immunology 111 NO_MATCH -1
15930396 Neurology & Neurosurgery 113 Neuroscience 2.52
23091165 Immunology 111 NO_MATCH -1
19152774 Nuclear Medicine & Medical Imaging 114 NO_MATCH -1
20186707 Neurology & Neurosurgery 113 Neuroscience 2.52
29084453 Psychiatry 123 NO_MATCH -1
30236287 Neurology & Neurosurgery 113 Neuroscience 2.52
15753226 Cardiovascular System & Hematology 100 NO_MATCH -1
17023662 General Science & Technology 83 NO_MATCH -1
19358879 Neurology & Neurosurgery 113 Neuroscience 2.52
15987945 Neurology & Neurosurgery 113 Neuroscience 2.52
24498100 General Science & Technology 83 NO_MATCH -1
16330706 Cardiovascular System & Hematology 100 NO_MATCH -1
15128854 Neurology & Neurosurgery 113 Neuroscience 2.52
24573298 Neurology & Neurosurgery 113 Neuroscience 2.52
21730062 Biochemistry & Molecular Biology 86 NO_MATCH -1
21525279 Neurology & Neurosurgery 113 Neuroscience 2.52
16025106 Neurology & Neurosurgery 113 Neuroscience 2.52
24013014 Neurology & Neurosurgery 113 Neuroscience 2.52
25744957 Public Health 141 NO_MATCH -1
24920623 Neurology & Neurosurgery 113 Neuroscience 2.52
16630834 Neurology & Neurosurgery 113 Neuroscience 2.52
27680698 General Science & Technology 83 NO_MATCH -1
17020964 Oncology & Carcinogenesis 116 NO_MATCH -1
16707781 Neurology & Neurosurgery 113 Neuroscience 2.52
19828787 Neurology & Neurosurgery 113 Neuroscience 2.52
17934455 Neurology & Neurosurgery 113 Neuroscience 2.52
19407813 Developmental Biology 88 NO_MATCH -1
15702476 Neurology & Neurosurgery 113 Neuroscience 2.52
17188890 Neurology & Neurosurgery 113 Neuroscience 2.52
8918832 Neurology & Neurosurgery 113 Neuroscience 2.52
22621370 Neurology & Neurosurgery 113 Neuroscience 2.52
16855103 Neurology & Neurosurgery 113 Neuroscience 2.52
23055476 Neurology & Neurosurgery 113 Neuroscience 2.52
12408842 Neurology & Neurosurgery 113 Neuroscience 2.52
21084616 Neurology & Neurosurgery 113 Neuroscience 2.52
15486301 General Science & Technology 83 NO_MATCH -1
15668238 Biochemistry & Molecular Biology 86 NO_MATCH -1
10751441 Neurology & Neurosurgery 113 Neuroscience 2.52
12957860 Biochemistry & Molecular Biology 86 NO_MATCH -1
17005175 Developmental Biology 88 NO_MATCH -1
11729324 General Science & Technology 83 NO_MATCH -1
15028767 Neurology & Neurosurgery 113 Neuroscience 2.52
21834083 Neurology & Neurosurgery 113 Neuroscience 2.52
17482097 Cardiovascular System & Hematology 100 NO_MATCH -1
12424359 General Science & Technology 83 NO_MATCH -1
18815271 Neurology & Neurosurgery 113 Neuroscience 2.52
11021829 Pathology 120 NO_MATCH -1
15277529 Biochemistry & Molecular Biology 86 NO_MATCH -1
15169782 Biochemistry & Molecular Biology 86 NO_MATCH -1
10195934 Cardiovascular System & Hematology 100 NO_MATCH -1
2554433 Endocrinology & Metabolism 105 NO_MATCH -1
14704852 Biochemistry & Molecular Biology 86 NO_MATCH -1
10825157 Biochemistry & Molecular Biology 86 NO_MATCH -1

Proposed

Consistent with the language in the original issue, the system should look at all the department affiliations and then identity the highest scoring match. Here's how it would score for the above categories....

logOddsRatio primaryDepartment scienceMetrixJournalSubfield scienceMetrixJournalSubfieldId
2.33 NeuroScience Neurology & Neurosurgery 113
2.28 Hematology and Medical Oncology Oncology & Carcinogenesis 116
1.93 Hematology and Medical Oncology Immunology 111
0.86 NeuroScience Psychiatry 123
0.68 Medicine Cardiovascular System & Hematology 100
0.48 Medicine Endocrinology & Metabolism 105
0.36 NeuroScience General Science & Technology 83
0.22 NeuroScience Developmental Biology 88
0.01 NeuroScience Biochemistry & Molecular Biology 86
-0.04 Hematology and Medical Oncology Pathology 120
-0.07 Medicine Public Health 141
-0.09 Neuroscience Nuclear Medicine & Medical Imaging 114
-0.23 Medicine Dermatology & Venereal Diseases 103
-0.41 NeuroScience Bioinformatics 12
-0.55 NeuroScience Biomedical Engineering 21
sarbajitdutta commented 5 years ago

@paulalbert1 is this for match with the updated ScienceMetrixDepartmentCategory table. Becuase I checked the code and it does take the max for logsOddRatio.

sarbajitdutta commented 5 years ago

To be exact this line https://github.com/wcmc-its/ReCiter/blob/master/src/main/java/reciter/algorithm/evidence/targetauthor/journalcategory/strategy/JournalCategoryStrategy.java#L69

paulalbert1 commented 5 years ago

@sarbajitdutta - If I run this for blhempst, I see this 72 out of 141 times

                    "journalSubfieldDepartment": "NO_MATCH",

That suggests to me that it is only considering the "Neuroscience" department. I'm asking that it consider all departments in Identity. You should get virtually no instance of NO_MATCH.