wcmc-its / ReCiter

ReCiter: an enterprise open source author disambiguation system for academic institutions
Apache License 2.0
45 stars 24 forks source link

Use journal similarity for phase one and two matching #21

Closed michaelbales1 closed 9 years ago

michaelbales1 commented 9 years ago

Background: Paul downloaded five years of Medline records and their associated MeSH terms. Then, using Jie's workflow for the grant recommendation tool project, Paul calculated the "field scores" for each journal. Then, Paul used some basic (non-rigorous) arithmetic to calculate similarity between journals such that any two journals that had at least one MeSH term in the past five years (n = 2300) have a similarity score relative to each other.

Understanding the scores:

How can this be used?

I will provide a link to the similarity file (300 MB) outside of Git Hub.

jl987-Jie commented 9 years ago

Added logic for Phase 1 journal similarity.

If the average journal similarity > 0.8, the current cosine similarity becomes equal to 1.5 times its current value.

The average journal similarity is computed as (sum of journal similarity scores that exist in table) / (# of comparisons made).

jl987-Jie commented 9 years ago

Currently, it's caching the similarity scores once it's retrieved. Still requires a long time to fetch this data. Need to come up with a better retrieval method.

jl987-Jie commented 9 years ago

Paul: Match the current article being assigned to the original member of a cluster. Journal similarity on a sliding scale: sim = (1 + f(A, B)) * sim

jl987-Jie commented 9 years ago

Added sim *= (1 + journalSimScore); to ReCiterClusterer.java as a sliding scale measurement.

paulalbert1 commented 9 years ago

I realize there are various competing scores involved, but we have a couple cases where this logic should have worked but didn't.

jl987-Jie commented 9 years ago

rmm2002 authored an article (18068236) in " Int J Cardiol" but it isn't successfully mapped to the cluster that is almost pure Cardiology, and as a result is a false negative. What explains this?

Assigned to the correct cluster by relaxing the constraint for affiliation matching from "weill cornell medical college" to "weill cornell". Another method would be to increase the journal similarity further.

paulalbert1 commented 9 years ago

Good thinking, @jl987-Jie. "weill cornell" should work for post-2000. "cornell medical" will work for earlier papers.

michaelbales1 commented 9 years ago

Jie has informed me that coding on this issue is completed.