opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Test if new literature release is missing publications #2852

Closed DSuveges closed 2 months ago

DSuveges commented 1 year ago

There's a well funded assumption that the new incrementally released EPMC data does not cover all publications. We need to identify examples that we are certain that were not covered in the release so the EPMC team can work against it.

As a start we'll work on the processed dataset generated by the literature etl: gs://open-targets-data-releases/22.11/output/etl/parquet/

Also once we are certain, I'll go back to verify to EPMC output: gs://otar025-epmc/

ireneisdoomed commented 1 year ago

I was reviewing #2685, so I have a few examples of missing PMIDs that might be useful. The total number is around 860k:

+--------+
|    pmid|
+--------+
|29297679|
|13664338|
|30700998|
|25191313|
|31908995|
|25503845|
|29540403|
|30979036|
|29518970|
|33653743|
|28643048|
|21691589|
|12494726|
| 4151190|
|28875950|
|30941058|
| 4334754|
|30410073|
|20537126|
|29768257|
+--------+

The full list is here: gs://ot-team/irene/missing_pmid_2852 Code to reproduce:

# Prod co-occurrences
cooc = spark.read.parquet('gs://open-targets-data-releases/22.11/output/literature-etl/parquet/cooccurrences/')

# Potential new co-occurrences
full_text = spark.read.json('gs://otar025-epmc/Full-text/2022_03_23')
abstract = spark.read.json('gs://otar025-epmc/Abstracts/2022_05_27')

new_pipeline_abstract_ids = abstract.select(f.explode("sentences.co-occurrence").alias("s"), "pmid").filter(f.col("s").isNotNull()).select("pmid").distinct().persist()
new_pipeline_ft_ids = full_text.select(f.explode("sentences.co-occurrence").alias("s"), "pmid").filter(f.col("s").isNotNull()).select("pmid").distinct().persist()
new_pipeline_ids = new_pipeline_ft_ids.unionByName(new_pipeline_abstract_ids, allowMissingColumns=True).persist()

# Missing pmids
missing = cooc.select("pmid").join(new_pipeline_ids, on=["pmid"], how="left_anti").distinct().persist()
DSuveges commented 1 year ago

When looking into the post-etl matches dataset (gs://open-targets-pre-data-releases/230123_literature/output/matches/) and investigate the distribution of ingested publications per month, we see a more or less even distribution until August 2022, then there's a drop:

image

This plot suggests there's something happened after the daily submission was put in place. Also the outliers probably can explained by papers with no full date annotations, in such cases the date falls back the January 1st. We could think some of the entities are not properly recognized, so I stratified the matches by entity types and plot their distribution across months:

image

Given the evenly distributed entity types, we can conclude the model itself works just fine.

To aid the troubleshooting I looked up specific cases that were missing from our datasets. I choose the example EGFR given it is consistently studied target, should be easy to find missing publications. When looking into our data, in 2022 there were 9k publications where EGFR was identified, however EuroPMC had 34k for the same period. Some of the publications that were missing:

Missing publications:

So it seems there are missing publications from both the full text and abstract only pile as well.

tsantosh7 commented 1 year ago

Thank you @DSuveges, for the analysis. I will check on our end.

ireneisdoomed commented 1 year ago

@tsantosh7 Could you please also take a look at the examples provided in #2692 ? A user reported problems with the data before the daily pipeline was set up which I think are also relevant in this context. Thanks!

tsantosh7 commented 1 year ago

Hi @ireneisdoomed I will go through the logs and comment on this thread in a couple of days.

ravwojdyla commented 1 year ago

👋 just curious if there's any updates to this issue, especially in the context of https://github.com/opentargets/issues/issues/2692?

DSuveges commented 1 year ago

👋 just curious if there's any updates to this issue, especially in the context of #2692?

A huge effort went into reviewing and updating the literature entity recognition and pipelines. The new data will be out with the 2023.06 release. Although we are not quite there yet, but it seems the existing literature related issues are going to be fixed.

ravwojdyla commented 1 year ago

@DSuveges amazing news! Thank you so much for the update!

DSuveges commented 1 year ago

After a thorough check it was appeared that the silent failures of data processing jobs caused the most prevalent source of missing publications. The notable publications ('36430903', '36318382', '36272856') highlighted in this comment were all present in the new dataset:

Articles in the pre-etl dataset:

+--------+-----+----------+-----------+----------+--------+
|    pmid|pmcid|   pubdate|match_count|cooc_count|   input|
+--------+-----+----------+-----------+----------+--------+
|36318382| null|2022-11-01|        116|        64|abstract|
|36430903| null|2022-11-20|         96|        12|abstract|
|36272856| null|2022-10-19|         16|        10|abstract|
|36272856| null|2022-10-20|         64|        40|abstract|
+--------+-----+----------+-----------+----------+--------+

Articles in the post-etl dataset:

+--------+----------+-----+
|    pmid|   pubDate|count|
+--------+----------+-----+
|36430903|2022-11-01|  240|
|36318382|2022-11-01|   23|
|36272856|2022-10-20|   15|
+--------+----------+-----+

Matches in the title post-etl dataset:

+--------+----------+-------+--------------------+--------------------+---------------+----+
|    pmid|     pmcid|section|               label|              labelN|      keywordId|type|
+--------+----------+-------+--------------------+--------------------+---------------+----+
|36318382|      null|  title|triple negative b...|breastcancernegtripl|    EFO_0005537|  DS|
|36430903|PMC9695078|  title|                EGFR|                egfr|ENSG00000146648|  GP|
|36318382|      null|  title|                EGFR|                egfr|ENSG00000146648|  GP|
|36430903|PMC9695078|  title|               NF-κB|                nfkb|ENSG00000109320|  GP|
|36272856|      null|  title|non-small cell lu...|carcinomacelllung...|    EFO_0003060|  DS|
|36272856|      null|  title|                EGFR|                egfr|ENSG00000146648|  GP|
|36430903|PMC9695078|  title|                 AKT|                 akt|ENSG00000142208|  GP|
+--------+----------+-------+--------------------+--------------------+---------------+----+

When all publications with existing normalized matches were compared we see this:

+-------+----------+
|overlap|     count|
+-------+----------+
|    new| 4_179_869|
|   both|10_531_518|
|    old| 2_885_964|
+-------+----------+

Although in the new dataset we have a ~4M extra publication, we also lose ~3M publications that were found in the old, but not in the new dataset. Interestingly, in the number of normalized matches, there's only a modest 6% increase from 420M to 445M.

DSuveges commented 2 months ago

This is not an issue anymore.