opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Cooccurrences are not identified from sentences, although entities are identified #3174

Open DSuveges opened 7 months ago

DSuveges commented 7 months ago

Describe the bug

There's a publication in EuroPMC, with this sentence in the abstract:

We report two cases: one girl suffering from typical skin and skeletal abnormalities, developmental delay, microcephaly, thin corpus callosum, periventricular gliosis and drug-resistant epilepsy caused by a PORCN nonsense-mutation (c.283C > T, p.Arg95Ter).

In this sentence a number of disease and target labels are identified, based onwhich, a number of diseas to target evidence should be generated. Eg. PORCN to epilepsy, however there's no association between these two entities on OT Platform.

When looking into the details, I could identify the chunk this publication was ingested from: gs://otar025-epmc/ml02/fulltext/2023_05_30/NMP_patch-29-05-2023-201.jsonl

Apparently most sensible entities were identified in that sentence:

+--------+--------------------+--------+-------------------+----+
|    pmid|                text| section|              label|type|
+--------+--------------------+--------+-------------------+----+
|35101074|We report two cas...|ABSTRACT|    periventricular|  DS|
|35101074|We report two cas...|ABSTRACT|developmental delay|  DS|
|35101074|We report two cas...|ABSTRACT|       microcephaly|  DS|
|35101074|We report two cas...|ABSTRACT|              PORCN|  GP|
|35101074|We report two cas...|ABSTRACT|           epilepsy|  DS|
+--------+--------------------+--------+-------------------+----+

Based on the types, I would expect to get 4 cooccurreces between PORCN and the four disease/syndrome type entities. However there's just one:

+--------+--------------------+--------+------+-------------------+
|    pmid|                text| section|label1|             label2|
+--------+--------------------+--------+------+-------------------+
|35101074|We report two cas...|ABSTRACT| PORCN|developmental delay|
+--------+--------------------+--------+------+-------------------+

Important to notes

Questions:

Exploratory work

DSuveges commented 7 months ago

There's an expectation that the number of disease entities times the number gene entities identified in a sentence is equal to the number of gene/disease cooccurrence from the same sentence. When looking at a chunk of full text input (gs://otar025-epmc/ml02/fulltext/2023_05_30/NMP_patch-29-05-2023-201.jsonl) I saw that out of 2,612 sentences with disease/target cooccurrences, this expectation was not met in 816 sentences. All cases, the number of cooccurrences were lower than expected never more. When one disease and one gene was identified from a sentence the number of cooccrrence was correct.

DSuveges commented 7 months ago

It seems, there might be bug/unexpected side effect on how the co-occurrences computed based on the matches suggested by looking into some examples:

Example 1:

matches:

+--------+----------+-------------------------------+----+
|pmid    |pmcid     |label                          |type|
+--------+----------+-------------------------------+----+
|34886853|PMC8656033|tumor                          |DS  |
|34886853|PMC8656033|tumor                          |DS  |
|34886853|PMC8656033|squamous cell carcinoma antigen|GP  |
|34886853|PMC8656033|SCCA                           |GP  |
|34886853|PMC8656033|carbohydrate antigen 125       |GP  |
|34886853|PMC8656033|CA125                          |GP  |
|34886853|PMC8656033|lymph node metastasis          |DS  |
|34886853|PMC8656033|cervical cancer                |DS  |
+--------+----------+-------------------------------+----+

cooccurrences:

+--------+----------+-------------------------------+---------------------+-----+
|pmid    |pmcid     |label1                         |label2               |type |
+--------+----------+-------------------------------+---------------------+-----+
|34886853|PMC8656033|squamous cell carcinoma antigen|tumor                |GP-DS|
|34886853|PMC8656033|squamous cell carcinoma antigen|lymph node metastasis|GP-DS|
|34886853|PMC8656033|squamous cell carcinoma antigen|cervical cancer      |GP-DS|
|34886853|PMC8656033|SCCA                           |lymph node metastasis|GP-DS|
|34886853|PMC8656033|SCCA                           |cervical cancer      |GP-DS|
|34886853|PMC8656033|carbohydrate antigen 125       |lymph node metastasis|GP-DS|
|34886853|PMC8656033|carbohydrate antigen 125       |cervical cancer      |GP-DS|
|34886853|PMC8656033|CA125                          |lymph node metastasis|GP-DS|
|34886853|PMC8656033|CA125                          |cervical cancer      |GP-DS|
+--------+----------+-------------------------------+---------------------+-----+

Example 1:

matches:

+--------+----------+----------------------------------+----+
|pmid    |pmcid     |label                             |type|
+--------+----------+----------------------------------+----+
|35119481|PMC8940827|tumor                             |DS  |
|35119481|PMC8940827|vascular endothelial growth factor|GP  |
|35119481|PMC8940827|VEGF                              |GP  |
|35119481|PMC8940827|fibroblast growth factor          |GP  |
|35119481|PMC8940827|FGF                               |GP  |
|35119481|PMC8940827|HCC                               |DS  |
+--------+----------+----------------------------------+----+

cooccurrences:

+--------+----------+----------------------------------+------+-----+
|pmid    |pmcid     |label1                            |label2|type |
+--------+----------+----------------------------------+------+-----+
|35119481|PMC8940827|vascular endothelial growth factor|tumor |GP-DS|
|35119481|PMC8940827|vascular endothelial growth factor|HCC   |GP-DS|
|35119481|PMC8940827|VEGF                              |HCC   |GP-DS|
|35119481|PMC8940827|fibroblast growth factor          |HCC   |GP-DS|
|35119481|PMC8940827|FGF                               |HCC   |GP-DS|
+--------+----------+----------------------------------+------+-----+
DSuveges commented 6 months ago

@tsantosh7 , we had a chat within the team and considering updating our pipelines so we can generate the cooccurrences in house. Is there any particular logic to do it, or are we expecting to get all possible entity pairs from a sentence? So no action on your side is expected.

DSuveges commented 6 months ago

An exploratory implementation of the logic (gist here). The process preforms quite well even in this un-optimized form and yields 59,2M disease/target cooccurrences compared to the 39,5M we have in production. Although this is almost a 50% increase, I would expect a way more modest increase in the number of evidence and an even smaller increase in the number of associations. Unfortunately, I cannot provide numbers for this increase, because the evidence generation operates on scores (currently assigned by the epmc team's pipeline), which I could not recapitulate.

tsantosh7 commented 5 months ago

@DSuveges seems like there is an issue with the logic implemented. Need to investigate