Missing literature matches from ETL output

DSuveges commented 2 months ago

Describe the bug

It has been identified that a certain drug/target connection is missing from the literature dataset. This (32955176) EuroPMC paper is available as abstract listing KRAS and sotorasib mentioned.

Sotorasib is a small molecule that selectively and irreversibly targets KRASG12C.

However this connection is not listed in the platform.

When investigating the case, I could identify the input json dataset: gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-33.jsonl

These are the entities identified by the EuroPMC pipeline:

+--------+---------------------------+----+---------------+-------------+
|pmid    |label                      |type|startInSentence|endInSentence|
+--------+---------------------------+----+---------------+-------------+
|32955176|KRAS                       |GP  |0              |4            |
|32955176|Solid Tumors               |DS  |58             |70           |
|32955176|KRAS                       |GP  |49             |53           |
|32955176|cancer                     |DS  |71             |77           |
|32955176|KRAS                       |GP  |7              |11           |
|32955176|non-small-cell lung cancers|DS  |49             |76           |
|32955176|NSCLCs                     |DS  |78             |84           |
|32955176|colorectal cancers         |DS  |104            |122          |
|32955176|cancers                    |DS  |133            |140          |
|32955176|KRAS                       |GP  |72             |76           |
|32955176|solid tumors               |DS  |176            |188          |
|32955176|KRAS                       |GP  |206            |210          |
|32955176|sotorasib                  |CD  |18             |27           |
|32955176|NSCLC                      |DS  |212            |217          |
|32955176|colorectal cancer          |DS  |227            |244          |
|32955176|tumors                     |DS  |264            |270          |
|32955176|metastatic disease         |DS  |96             |114          |
|32955176|NSCLC                      |DS  |21             |26           |
|32955176|colorectal cancer          |DS  |21             |38           |
|32955176|appendiceal cancers        |DS  |75             |94           |
|32955176|melanoma                   |DS  |99             |107          |
|32955176|solid tumors               |DS  |222            |234          |
|32955176|KRAS                       |GP  |252            |256          |
+--------+---------------------------+----+---------------+-------------+

However, if I look at the ETL output, neither the mapped[1] nor the failed matches[2] dataset contains KRAS:

+--------+-----+----+--------------------+--------------------+-------------+--------+
|    pmid|pmcid|type|               label|              labelN|    keywordId|isMapped|
+--------+-----+----+--------------------+--------------------+-------------+--------+
|32955176| null|  CD|           sotorasib|           sotorasib|CHEMBL4535757|    true|
|32955176| null|  DS|              cancer|              cancer|MONDO_0004992|    true|
|32955176| null|  DS|             cancers|              cancer|MONDO_0004992|    true|
|32955176| null|  DS|  metastatic disease|      diseasmetastat|  EFO_0009709|    true|
|32955176| null|  DS|              NSCLCs|               nsclc|  EFO_0003060|    true|
|32955176| null|  DS|               NSCLC|               nsclc|  EFO_0003060|    true|
|32955176| null|  DS|               NSCLC|               nsclc|  EFO_0003060|    true|
|32955176| null|  DS|non-small-cell lu...|cancercelllungnon...|  EFO_0003060|    true|
|32955176| null|  DS|            melanoma|            melanoma|  EFO_0000756|    true|
|32955176| null|  DS|              tumors|               tumor|  EFO_0000616|    true|
|32955176| null|  DS|        solid tumors|                null|         null|   false|
|32955176| null|  DS|        solid tumors|                null|         null|   false|
|32955176| null|  DS|        Solid Tumors|                null|         null|   false|
|32955176| null|  DS| appendiceal cancers|                null|         null|   false|
+--------+-----+----+--------------------+--------------------+-------------+--------+

[1] gs://open-targets-data-releases/24.03/output/etl/parquet/literature/matches [2] gs://open-targets-data-releases/24.03/output/etl/parquet/literature/failedMatches

Expectation

All matches identified the by the EuroPMC pipelines are expected to be in either in the failed matches or in the mapped matches dataset.

remo87 commented 2 months ago

I found what's causing the records to be dropped this is happening in the disambiguate function from the Grounding.scala file. The drop is happening in the filter.

filter(
      col(minDistinctKeywordsPerLabelPerPubOverKeywordPerPub) <= col(
        minDistinctKeywordsPerLabelOverKeywordOverallPubs
      )

And the KRAS records don't meet this criteria.

labelN	isMapped	minDistinctKeywordsPerLabelPerPubOverKeywordPerPub	minDistinctKeywordsPerLabelOverKeywordOverallPubs
kras	true	2	1
kras	true	2	1
kras	true	2	1
kras	true	2	1
kras	true	2	1

DSuveges commented 2 months ago

Oh, this is not good. We need to understand what is going on exactly and change that function. I know some details about that function, it was prototyped for Miguel by @tskir back in the days.

remo87 commented 2 months ago

@DSuveges directed me to the original issue and the original implementation proposal in pyspark form @tskir. I'll be reviewing those items to get more information in the feature.

opentargets / issues

Missing literature matches from ETL output #3285