Closed ravwojdyla closed 1 year ago
Hi @ravwojdyla!
The number of FDA references that we are losing are now linking out to DailyMed. So the reference is not lost, it is simply provided by a different resource.
You can see it clearly in this example. This is a 21.11 evidence for Warfarin and atrial fibrillation. The reference is to FDA.
datasourceId | chembl
targetId | ENSG00000167397
clinicalPhase | 4
datatypeId | known_drug
diseaseFromSourceMappedId | EFO_0000275
drugId | CHEMBL1200879
targetFromSource | CHEMBL1930
targetFromSourceId | Q9BQB6
urls | [{FDA, https://api.fda.gov/drug/label.json?search=set_id:6a0068ba-19bd-4711-b894-868fd007a613}]
diseaseId | EFO_0000275
score | 1.0
sourceId | chembl
This is the corresponding 22.06 record, where the reference is to Dailymed.
datasourceId | chembl
targetId | ENSG00000167397
clinicalPhase | 4
datatypeId | known_drug
diseaseFromSourceMappedId | EFO_0000275
drugId | CHEMBL1200879
studyStopReasonCategories | null
targetFromSource | CHEMBL1930
targetFromSourceId | Q9BQB6
urls | [{DailyMed, https://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=6a0068ba-19bd-4711-b894-868fd007a613}]
diseaseId | EFO_0000275
id | a32f68ed197f5d9c8a49e65b122105926a4676ff
score | 1.0
sourceId | chembl
OpenFDA and DailyMed are the same sources of medicinal product labels that have been submitted to FDA. The content of DailyMed is provided by FDA, so the only difference is the url link. Note that the unique label identifier is the set_id: 6a0068ba-19bd-4711-b894-868fd007a613
.
So the set_ids is a good way of comparing the references. I did the exercise of looking at these set IDs to make sure we weren't losing references, and this yielded one lost record only, which might be affected by the newer automatic DailyMed pipeline. This has been shared with my colleagues at ChEMBL. The snippet in PySpark, in case it is useful:
evidence_2111_path = '21.11/evidence/sourceId=chembl/'
evidence_2206_path = '22.06/evidence/sourceId=chembl/
def extract_set_id(df: DataFrame) -> DataFrame:
df = (
spark.read.parquet(df)
.withColumn('url', F.explode('urls'))
.withColumn(
'set_id',
F.when(
(F.col('url.niceName') == 'FDA') & (F.col('url.url').contains('set_id')),
F.element_at(F.split('url.url', '\:'), -1)
)
.when(
(F.col('url.niceName') == 'DailyMed') & (F.col('url.url').contains('setid')),
F.element_at(F.split('url.url', '='), -1)))
)
def compare_setids(df1: DataFrame, df2: DataFrame) -> DataFrame:
df1 = extract_set_id(df1).filter(F.col('set_id').isNotNull()).select('drugId', 'set_id').distinct()
df2 = extract_set_id(df2).filter(F.col('set_id').isNotNull()).select('drugId', 'set_id').distinct()
return (
df1.join(df2, on='set_id', how='left_anti')
)
Note that the unique label identifier is the set_id: 6a0068ba-19bd-4711-b894-868fd007a613.
Ah, great to know about the shared setid
! This is very helpful and great to hear that this is not an issue. Thank you again @ireneisdoomed!
I'm comparing release 21.11 and 22.06 as part of update process, and have noticed a substantial decrease in the number of FDA approved drugs:
When looking at the unique drug IDs, source pairs:
My first thought was that maybe this was related to this change: https://blog.opentargets.org/open-targets-platform-22-04-release/#druglabels? Except https://github.com/opentargets/issues/issues/2567 would suggest otherwise, given the stats in https://github.com/opentargets/issues/issues/2567#issuecomment-1112155823. So this seems like a change between 22.04 and 22.06? Is this expected?