Use journal similarity for phase one and two matching - Githubissues

wcmc-its / ReCiter

ReCiter: an enterprise open source author disambiguation system for academic institutions

Apache License 2.0

45 stars 24 forks source link

Use journal similarity for phase one and two matching #21

Closed michaelbales1 closed 9 years ago

michaelbales1 commented 9 years ago

Background: Paul downloaded five years of Medline records and their associated MeSH terms. Then, using Jie's workflow for the grant recommendation tool project, Paul calculated the "field scores" for each journal. Then, Paul used some basic (non-rigorous) arithmetic to calculate similarity between journals such that any two journals that had at least one MeSH term in the past five years (n = 2300) have a similarity score relative to each other.

Understanding the scores:

The range is 0-1. The average score is ~0.65.
A high score suggests high similarity and a greater likelihood that a given author wrote for both journals. A low score suggests the opposite.
The matching is done based on Medline title abbreviation. For example, the PMID 25864809 has a title abbreviation of Acad Pediatr.

How can this be used?

Phase One: decide if a publication should be part of a cluster; for example:
- You have 3 articles, A, B, and C. A has the most complete information followed by B.
- Do a lookup in the table to see how similar A is to B. Their similarity is 0.91. So you put them in the same cluster.
- The similarity of C to A and C to B is an average of 0.6 (see distribution, and falls below the predetermined arbitrary threshold (0.8?), so C is in its own cluster.
Phase Two: use default department scores when people have few publications
- The same sort of matching described above can work for Phase Two matching.
- This is especially true if someone has no or few publications when we can calculate default scores department (not ready yet).

I will provide a link to the similarity file (300 MB) outside of Git Hub.

jl987-Jie commented 9 years ago

Added logic for Phase 1 journal similarity.

If the average journal similarity > 0.8, the current cosine similarity becomes equal to 1.5 times its current value.

The average journal similarity is computed as (sum of journal similarity scores that exist in table) / (# of comparisons made).

jl987-Jie commented 9 years ago

Currently, it's caching the similarity scores once it's retrieved. Still requires a long time to fetch this data. Need to come up with a better retrieval method.

jl987-Jie commented 9 years ago

Paul: Match the current article being assigned to the original member of a cluster. Journal similarity on a sliding scale: sim = (1 + f(A, B)) * sim

jl987-Jie commented 9 years ago

Added sim *= (1 + journalSimScore); to ReCiterClusterer.java as a sliding scale measurement.

paulalbert1 commented 9 years ago

I realize there are various competing scores involved, but we have a couple cases where this logic should have worked but didn't.

rmm2002 authored an article (18068236) in " Int J Cardiol" but it isn't successfully mapped to the cluster that is almost pure Cardiology, and as a result is a false negative. What explains this?
jww2001 should map to 25623219 based on journal similarity.
rdgranst should map to 10951265 and 24763504
cnathan by virtue of his other publications in immunology should map to the journals associated with these articles which are highly focused on immunology: "19286131 11135572 16192449 9698876 7553846 1725935 2040651 2715632 7909663 2510571"
For ljgudas, these papers - 1939678, 2289970 - shouldn't be clustered with the large group of biochemistry/cancer papers. These two articles appear in J Dev Behav Pediatr, which is a psychiatry / behavioral science journal.
For rjm2002, these papers should be clustered with the larger group of radiology papers:
- 24065258
- 23849389
- 23245821
- 21694949
- 21266554
- 20876891
- 20813299
- 19237054
- 17901143
- 9124127
mroman studies cardiovascular disease as evidenced by the true positives, but there is a group of false negatives also in that field that got left out of the cluster:
- 2470462
- 2437995
- 24268115
brs9035 does cardiothoracic surgery as evidenced by the true positives, but there is whole other group of articles, that he didn't write, published in journals devoted to infectious disease and biochemistry:
- 8726063
- 1725235
- 10391869
- 19747126
- 19254170
- 9234818
- 7890377
- 8225606
- 1343780
- 1718309
- 20949003
- 9359860
- 7945236
- 2012611
- 15609239
- 9534976
- 8212028
- 1378233
- 1862522
- 1796476
- 1724806
- 9066029

jl987-Jie commented 9 years ago

rmm2002 authored an article (18068236) in " Int J Cardiol" but it isn't successfully mapped to the cluster that is almost pure Cardiology, and as a result is a false negative. What explains this?

Assigned to the correct cluster by relaxing the constraint for affiliation matching from "weill cornell medical college" to "weill cornell". Another method would be to increase the journal similarity further.

paulalbert1 commented 9 years ago

Good thinking, @jl987-Jie. "weill cornell" should work for post-2000. "cornell medical" will work for earlier papers.

michaelbales1 commented 9 years ago

Jie has informed me that coding on this issue is completed.