Open DSuveges opened 12 months ago
There's an expectation that the number of disease entities times the number gene entities identified in a sentence is equal to the number of gene/disease cooccurrence from the same sentence. When looking at a chunk of full text input (gs://otar025-epmc/ml02/fulltext/2023_05_30/NMP_patch-29-05-2023-201.jsonl
) I saw that out of 2,612 sentences with disease/target cooccurrences, this expectation was not met in 816 sentences. All cases, the number of cooccurrences were lower than expected never more. When one disease and one gene was identified from a sentence the number of cooccrrence was correct.
It seems, there might be bug/unexpected side effect on how the co-occurrences computed based on the matches suggested by looking into some examples:
34886853
Comparing the dia...
ABSTRACT
matches:
+--------+----------+-------------------------------+----+
|pmid |pmcid |label |type|
+--------+----------+-------------------------------+----+
|34886853|PMC8656033|tumor |DS |
|34886853|PMC8656033|tumor |DS |
|34886853|PMC8656033|squamous cell carcinoma antigen|GP |
|34886853|PMC8656033|SCCA |GP |
|34886853|PMC8656033|carbohydrate antigen 125 |GP |
|34886853|PMC8656033|CA125 |GP |
|34886853|PMC8656033|lymph node metastasis |DS |
|34886853|PMC8656033|cervical cancer |DS |
+--------+----------+-------------------------------+----+
cooccurrences:
+--------+----------+-------------------------------+---------------------+-----+
|pmid |pmcid |label1 |label2 |type |
+--------+----------+-------------------------------+---------------------+-----+
|34886853|PMC8656033|squamous cell carcinoma antigen|tumor |GP-DS|
|34886853|PMC8656033|squamous cell carcinoma antigen|lymph node metastasis|GP-DS|
|34886853|PMC8656033|squamous cell carcinoma antigen|cervical cancer |GP-DS|
|34886853|PMC8656033|SCCA |lymph node metastasis|GP-DS|
|34886853|PMC8656033|SCCA |cervical cancer |GP-DS|
|34886853|PMC8656033|carbohydrate antigen 125 |lymph node metastasis|GP-DS|
|34886853|PMC8656033|carbohydrate antigen 125 |cervical cancer |GP-DS|
|34886853|PMC8656033|CA125 |lymph node metastasis|GP-DS|
|34886853|PMC8656033|CA125 |cervical cancer |GP-DS|
+--------+----------+-------------------------------+---------------------+-----+
35119481
Angiogenesis play..
INTRO
matches:
+--------+----------+----------------------------------+----+
|pmid |pmcid |label |type|
+--------+----------+----------------------------------+----+
|35119481|PMC8940827|tumor |DS |
|35119481|PMC8940827|vascular endothelial growth factor|GP |
|35119481|PMC8940827|VEGF |GP |
|35119481|PMC8940827|fibroblast growth factor |GP |
|35119481|PMC8940827|FGF |GP |
|35119481|PMC8940827|HCC |DS |
+--------+----------+----------------------------------+----+
cooccurrences:
+--------+----------+----------------------------------+------+-----+
|pmid |pmcid |label1 |label2|type |
+--------+----------+----------------------------------+------+-----+
|35119481|PMC8940827|vascular endothelial growth factor|tumor |GP-DS|
|35119481|PMC8940827|vascular endothelial growth factor|HCC |GP-DS|
|35119481|PMC8940827|VEGF |HCC |GP-DS|
|35119481|PMC8940827|fibroblast growth factor |HCC |GP-DS|
|35119481|PMC8940827|FGF |HCC |GP-DS|
+--------+----------+----------------------------------+------+-----+
@tsantosh7 , we had a chat within the team and considering updating our pipelines so we can generate the cooccurrences in house. Is there any particular logic to do it, or are we expecting to get all possible entity pairs from a sentence? So no action on your side is expected.
An exploratory implementation of the logic (gist here). The process preforms quite well even in this un-optimized form and yields 59,2M disease/target cooccurrences compared to the 39,5M we have in production. Although this is almost a 50% increase, I would expect a way more modest increase in the number of evidence and an even smaller increase in the number of associations. Unfortunately, I cannot provide numbers for this increase, because the evidence generation operates on scores (currently assigned by the epmc team's pipeline), which I could not recapitulate.
@DSuveges seems like there is an issue with the logic implemented. Need to investigate
Describe the bug
There's a publication in EuroPMC, with this sentence in the abstract:
In this sentence a number of disease and target labels are identified, based onwhich, a number of diseas to target evidence should be generated. Eg.
PORCN
toepilepsy
, however there's no association between these two entities on OT Platform.When looking into the details, I could identify the chunk this publication was ingested from:
gs://otar025-epmc/ml02/fulltext/2023_05_30/NMP_patch-29-05-2023-201.jsonl
Apparently most sensible entities were identified in that sentence:
Based on the types, I would expect to get 4 cooccurreces between PORCN and the four disease/syndrome type entities. However there's just one:
Important to notes
Questions:
Exploratory work