Closed michaelbales1 closed 9 years ago
Added logic for Phase 1 journal similarity.
If the average journal similarity > 0.8, the current cosine similarity becomes equal to 1.5 times its current value.
The average journal similarity is computed as (sum of journal similarity scores that exist in table) / (# of comparisons made).
Currently, it's caching the similarity scores once it's retrieved. Still requires a long time to fetch this data. Need to come up with a better retrieval method.
Paul: Match the current article being assigned to the original member of a cluster. Journal similarity on a sliding scale: sim = (1 + f(A, B)) * sim
Added sim *= (1 + journalSimScore);
to ReCiterClusterer.java
as a sliding scale measurement.
I realize there are various competing scores involved, but we have a couple cases where this logic should have worked but didn't.
rmm2002 authored an article (18068236) in " Int J Cardiol" but it isn't successfully mapped to the cluster that is almost pure Cardiology, and as a result is a false negative. What explains this?
Assigned to the correct cluster by relaxing the constraint for affiliation matching from "weill cornell medical college" to "weill cornell". Another method would be to increase the journal similarity further.
Good thinking, @jl987-Jie. "weill cornell" should work for post-2000. "cornell medical" will work for earlier papers.
Jie has informed me that coding on this issue is completed.
Background: Paul downloaded five years of Medline records and their associated MeSH terms. Then, using Jie's workflow for the grant recommendation tool project, Paul calculated the "field scores" for each journal. Then, Paul used some basic (non-rigorous) arithmetic to calculate similarity between journals such that any two journals that had at least one MeSH term in the past five years (n = 2300) have a similarity score relative to each other.
Understanding the scores:
How can this be used?
I will provide a link to the similarity file (300 MB) outside of Git Hub.