Open DSuveges opened 2 months ago
I found what's causing the records to be dropped this is happening in the disambiguate function from the Grounding.scala
file. The drop is happening in the filter.
filter(
col(minDistinctKeywordsPerLabelPerPubOverKeywordPerPub) <= col(
minDistinctKeywordsPerLabelOverKeywordOverallPubs
)
And the KRAS records don't meet this criteria.
labelN | isMapped | minDistinctKeywordsPerLabelPerPubOverKeywordPerPub | minDistinctKeywordsPerLabelOverKeywordOverallPubs |
---|---|---|---|
kras | true | 2 | 1 |
kras | true | 2 | 1 |
kras | true | 2 | 1 |
kras | true | 2 | 1 |
kras | true | 2 | 1 |
Oh, this is not good. We need to understand what is going on exactly and change that function. I know some details about that function, it was prototyped for Miguel by @tskir back in the days.
@DSuveges directed me to the original issue and the original implementation proposal in pyspark form @tskir. I'll be reviewing those items to get more information in the feature.
Describe the bug
It has been identified that a certain drug/target connection is missing from the literature dataset. This (32955176) EuroPMC paper is available as abstract listing KRAS and sotorasib mentioned.
However this connection is not listed in the platform.
When investigating the case, I could identify the input json dataset:
gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-33.jsonl
These are the entities identified by the EuroPMC pipeline:
However, if I look at the ETL output, neither the mapped[1] nor the failed matches[2] dataset contains KRAS:
[1] gs://open-targets-data-releases/24.03/output/etl/parquet/literature/matches [2] gs://open-targets-data-releases/24.03/output/etl/parquet/literature/failedMatches
Expectation
All matches identified the by the EuroPMC pipelines are expected to be in either in the failed matches or in the mapped matches dataset.