12% of ChEMBL evidence are duplicates

ireneisdoomed commented 4 weeks ago

Describe the bug The metrics for the upcoming release show we have 19,000 more evidence marked as duplicate compared to 24.03. This has made us look at them more closely, to identify that we are marking evidence as duplicate when in fact it's not.

This is not a new problem, 12% of the evidence in the March release was also dropped due to duplication.

Observed behaviour An example:

# Evidence set
 datasourceId              | chembl
 targetId                  | ENSG00000027075
 clinicalPhase             | 1.0
 clinicalStatus            | Completed
 datatypeId                | known_drug
 diseaseFromSource         | Stage IV Mucoepidermoid Carcinoma of the Oral Cavity
 diseaseFromSourceMappedId | MONDO_0044964
 drugId                    | CHEMBL574737
 studyStartDate            | 2001-12-01
 targetFromSource          | CHEMBL2093867
 targetFromSourceId        | P24723
 urls                      | [{ClinicalTrials, https://clinicaltrials.gov/study/NCT00031681}]
 diseaseId                 | MONDO_0044964
 id                        | 391630b73b67fa1ab5fe7de3f319fa0af780ef27
 score                     | 0.1
 variantEffect             | LoF
 directionOnTrait          | protect

# Invalid evidence set
 datasourceId              | chembl
 targetId                  | ENSG00000027075
 clinicalPhase             | 1.0
 clinicalStatus            | Completed
 datatypeId                | known_drug
 diseaseFromSource         | Recurrent Mucoepidermoid Carcinoma of the Oral Cavity
 diseaseFromSourceMappedId | MONDO_0044964
 drugId                    | CHEMBL574737
 studyStartDate            | 2001-12-01
 targetFromSource          | CHEMBL2093867
 targetFromSourceId        | P24723
 urls                      | [{ClinicalTrials, https://clinicaltrials.gov/study/NCT00031681}]
 resolvedTarget            | true
 resolvedDisease           | true
 diseaseId                 | MONDO_0044964
 excludedBiotype           | false
 id                        | 391630b73b67fa1ab5fe7de3f319fa0af780ef27
 score                     | 0.1
 nullifiedScore            | false
 markedDuplicate           | true
 variantEffect             | LoF
 directionOnTrait          | protect

The fields for drug, target, disease and NCT IDs are all the same for both rows, the only difference is diseaseFromSource where the two different stages of Mucoepidermoid Carcinoma map to the same ID.

Possible solutions

Consider diseaseFromSource as an unique field and show all evidence.
Aggregate this granularity of the description of phenotypes inside cohortPhenotypes. We have followed this approach to deal with a similar situation with ClinVar evidence. We would need to coordinate with ChEMBL.
Do nothing. Perhaps it is not so pivotal to reproduce all the conditions mentioned in the clinical trial.

I'm in favour of option 2. I think it is good to display the granularity if we have it, but I'd avoid having multiple rows where 99% of the information is the same.

DSuveges commented 3 weeks ago

I'm also for option 2. Just for clarification: although the clinical trials page lists a large number of conditions, only those conditions are expected to be collected into the cohortPhenotypes lists that are mapped to the same EFO, right?

Regarding the implementation, we are already touching the ChEMBL evidence to add the stop reason categories, is there a plan to move that part of the evidence generation to ChEMBL? If no, and if there's a good reason to assume the planned aggregation would take them a long time, we can implement ourselves.

ireneisdoomed commented 3 weeks ago

Exactly
We are not planning to migrate that part. However this is a small change. According to their release cycle, we should expect a submission in August so we could have it fixed for our Sept. release.

cc @FionaEBI We want to solve this issue by aggregating the evidence on all the unique fields and collect in a list all these conditions that are related. What do you think? Please raise any concerns. Happy to discuss offline!

FionaEBI commented 3 weeks ago

@ireneisdoomed Your suggestion of aggregating the granularity of the CT conditions that are all mapped to the same EFO_id seems to make sense to me. But I'm happy to have a meeting to discuss any details. Do you need us to change some detail of what we deliver to OT? Or will this happen entirely at OT after we have delivered the data? Thanks, Fiona

ireneisdoomed commented 3 weeks ago

@FionaEBI We'd like to receive this from you so we keep the logic on our side to a minimum.

opentargets / issues

12% of ChEMBL evidence are duplicates #3328