Closed DSuveges closed 2 months ago
I was reviewing #2685, so I have a few examples of missing PMIDs that might be useful. The total number is around 860k:
+--------+
| pmid|
+--------+
|29297679|
|13664338|
|30700998|
|25191313|
|31908995|
|25503845|
|29540403|
|30979036|
|29518970|
|33653743|
|28643048|
|21691589|
|12494726|
| 4151190|
|28875950|
|30941058|
| 4334754|
|30410073|
|20537126|
|29768257|
+--------+
The full list is here: gs://ot-team/irene/missing_pmid_2852 Code to reproduce:
# Prod co-occurrences
cooc = spark.read.parquet('gs://open-targets-data-releases/22.11/output/literature-etl/parquet/cooccurrences/')
# Potential new co-occurrences
full_text = spark.read.json('gs://otar025-epmc/Full-text/2022_03_23')
abstract = spark.read.json('gs://otar025-epmc/Abstracts/2022_05_27')
new_pipeline_abstract_ids = abstract.select(f.explode("sentences.co-occurrence").alias("s"), "pmid").filter(f.col("s").isNotNull()).select("pmid").distinct().persist()
new_pipeline_ft_ids = full_text.select(f.explode("sentences.co-occurrence").alias("s"), "pmid").filter(f.col("s").isNotNull()).select("pmid").distinct().persist()
new_pipeline_ids = new_pipeline_ft_ids.unionByName(new_pipeline_abstract_ids, allowMissingColumns=True).persist()
# Missing pmids
missing = cooc.select("pmid").join(new_pipeline_ids, on=["pmid"], how="left_anti").distinct().persist()
When looking into the post-etl matches dataset (gs://open-targets-pre-data-releases/230123_literature/output/matches/
) and investigate the distribution of ingested publications per month, we see a more or less even distribution until August 2022, then there's a drop:
This plot suggests there's something happened after the daily submission was put in place. Also the outliers probably can explained by papers with no full date annotations, in such cases the date falls back the January 1st. We could think some of the entities are not properly recognized, so I stratified the matches by entity types and plot their distribution across months:
Given the evenly distributed entity types, we can conclude the model itself works just fine.
To aid the troubleshooting I looked up specific cases that were missing from our datasets. I choose the example EGFR given it is consistently studied target, should be easy to find missing publications. When looking into our data, in 2022 there were 9k publications where EGFR was identified, however EuroPMC had 34k for the same period. Some of the publications that were missing:
Missing publications:
36430903
: Morusin Protected Ruminal Epithelial Cells against Lipopolysaccharide-Induced Inflammation through Inhibiting EGFR-AKT/NF-κB Signaling and Improving Barrier Functions. 20 Nov 2022 full text: True36318382
: EGFR as a potent CAR T target in triple negative breast cancer brain metastases. 01 Nov 2022 full text: False36272856
: Management of EGFR mutated non-small cell lung carcinoma patients. 20 Oct 2022 full text: FalseSo it seems there are missing publications from both the full text and abstract only pile as well.
Thank you @DSuveges, for the analysis. I will check on our end.
@tsantosh7 Could you please also take a look at the examples provided in #2692 ? A user reported problems with the data before the daily pipeline was set up which I think are also relevant in this context. Thanks!
Hi @ireneisdoomed I will go through the logs and comment on this thread in a couple of days.
👋 just curious if there's any updates to this issue, especially in the context of https://github.com/opentargets/issues/issues/2692?
👋 just curious if there's any updates to this issue, especially in the context of #2692?
A huge effort went into reviewing and updating the literature entity recognition and pipelines. The new data will be out with the 2023.06 release. Although we are not quite there yet, but it seems the existing literature related issues are going to be fixed.
@DSuveges amazing news! Thank you so much for the update!
After a thorough check it was appeared that the silent failures of data processing jobs caused the most prevalent source of missing publications. The notable publications ('36430903', '36318382', '36272856'
) highlighted in this comment were all present in the new dataset:
Articles in the pre-etl dataset:
+--------+-----+----------+-----------+----------+--------+
| pmid|pmcid| pubdate|match_count|cooc_count| input|
+--------+-----+----------+-----------+----------+--------+
|36318382| null|2022-11-01| 116| 64|abstract|
|36430903| null|2022-11-20| 96| 12|abstract|
|36272856| null|2022-10-19| 16| 10|abstract|
|36272856| null|2022-10-20| 64| 40|abstract|
+--------+-----+----------+-----------+----------+--------+
Articles in the post-etl dataset:
+--------+----------+-----+
| pmid| pubDate|count|
+--------+----------+-----+
|36430903|2022-11-01| 240|
|36318382|2022-11-01| 23|
|36272856|2022-10-20| 15|
+--------+----------+-----+
Matches in the title post-etl dataset:
+--------+----------+-------+--------------------+--------------------+---------------+----+
| pmid| pmcid|section| label| labelN| keywordId|type|
+--------+----------+-------+--------------------+--------------------+---------------+----+
|36318382| null| title|triple negative b...|breastcancernegtripl| EFO_0005537| DS|
|36430903|PMC9695078| title| EGFR| egfr|ENSG00000146648| GP|
|36318382| null| title| EGFR| egfr|ENSG00000146648| GP|
|36430903|PMC9695078| title| NF-κB| nfkb|ENSG00000109320| GP|
|36272856| null| title|non-small cell lu...|carcinomacelllung...| EFO_0003060| DS|
|36272856| null| title| EGFR| egfr|ENSG00000146648| GP|
|36430903|PMC9695078| title| AKT| akt|ENSG00000142208| GP|
+--------+----------+-------+--------------------+--------------------+---------------+----+
When all publications with existing normalized matches were compared we see this:
+-------+----------+
|overlap| count|
+-------+----------+
| new| 4_179_869|
| both|10_531_518|
| old| 2_885_964|
+-------+----------+
Although in the new dataset we have a ~4M extra publication, we also lose ~3M publications that were found in the old, but not in the new dataset. Interestingly, in the number of normalized matches, there's only a modest 6% increase from 420M to 445M.
This is not an issue anymore.
There's a well funded assumption that the new incrementally released EPMC data does not cover all publications. We need to identify examples that we are certain that were not covered in the release so the EPMC team can work against it.
As a start we'll work on the processed dataset generated by the literature etl:
gs://open-targets-data-releases/22.11/output/etl/parquet/
Also once we are certain, I'll go back to verify to EPMC output:
gs://otar025-epmc/