opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Dropping ambiguous publication entries from literature #3060

Open DSuveges opened 10 months ago

DSuveges commented 10 months ago

As a user I want unambiguous pmid -> pmcid mappings in literature because now multiple pmcid is linked with a single pmid leading to confusing behaviour. (for more context, see #3053 and #2970)

Although the bug is originated upstream to OT, we should take care of this ambiguity before publicly releasing the post-etl literature data. By nature, this issue only affects full text articles.

Tasks

Acceptance tests

(
    matches
    .groupby('pmid')
    .agg(
        f.collect_set(f.col('pmcid')).alias('pmcids')
    )
    .filter(
        f.size(f.col('pmcids')) > 1
    )
    .count()
)
mbdebian commented 2 months ago

@remo87 , would you mind checking with @DSuveges on this one? Thanks!

DSuveges commented 2 months ago

The problem still exists, as the above quirey flags the following pmids:

+--------+-------------------------+
|pmid    |pmcids                   |
+--------+-------------------------+
|32790207|[PMC10081512, PMC7285927]|
|33376052|[PMC7709584, PMC7983453] |
|31745814|[PMC7574644, PMC6940410] |
+--------+-------------------------+

That's all. Out of the millions of pmids. If there's no way to find out which is the "real" pmcid, we can just drop them.