Open ireneisdoomed opened 4 weeks ago
I'm also for option 2. Just for clarification: although the clinical trials page lists a large number of conditions, only those conditions are expected to be collected into the cohortPhenotypes
lists that are mapped to the same EFO, right?
Regarding the implementation, we are already touching the ChEMBL evidence to add the stop reason categories, is there a plan to move that part of the evidence generation to ChEMBL? If no, and if there's a good reason to assume the planned aggregation would take them a long time, we can implement ourselves.
cc @FionaEBI We want to solve this issue by aggregating the evidence on all the unique fields and collect in a list all these conditions that are related. What do you think? Please raise any concerns. Happy to discuss offline!
@ireneisdoomed Your suggestion of aggregating the granularity of the CT conditions that are all mapped to the same EFO_id seems to make sense to me. But I'm happy to have a meeting to discuss any details. Do you need us to change some detail of what we deliver to OT? Or will this happen entirely at OT after we have delivered the data? Thanks, Fiona
@FionaEBI We'd like to receive this from you so we keep the logic on our side to a minimum.
Describe the bug The metrics for the upcoming release show we have 19,000 more evidence marked as duplicate compared to 24.03. This has made us look at them more closely, to identify that we are marking evidence as duplicate when in fact it's not.
This is not a new problem, 12% of the evidence in the March release was also dropped due to duplication.
Observed behaviour An example:
The fields for drug, target, disease and NCT IDs are all the same for both rows, the only difference is
diseaseFromSource
where the two different stages of Mucoepidermoid Carcinoma map to the same ID.Possible solutions
diseaseFromSource
as an unique field and show all evidence.cohortPhenotypes
. We have followed this approach to deal with a similar situation with ClinVar evidence. We would need to coordinate with ChEMBL.I'm in favour of option 2. I think it is good to display the granularity if we have it, but I'd avoid having multiple rows where 99% of the information is the same.