Substantially smaller number of FDA approved drugs in 22.06

ravwojdyla commented 1 year ago

I'm comparing release 21.11 and 22.06 as part of update process, and have noticed a substantial decrease in the number of FDA approved drugs:

source_name	21.11	22.06	pct_change
ctgov	463,775	486,943	0.04995525848
atc	4,430	4,482	0.01173814898
fda	21,166	1,476	(0.9302655202)
dailymed	6,348	103,692	15.33459357

When looking at the unique drug IDs, source pairs:

source_name	21.11	22.06	pct_change
atc	1227	1243	0.01
ctgov	3481	3710	0.07
dailymed	835	1367	0.64
fda	740	360	(0.51)

My first thought was that maybe this was related to this change: https://blog.opentargets.org/open-targets-platform-22-04-release/#druglabels? Except https://github.com/opentargets/issues/issues/2567 would suggest otherwise, given the stats in https://github.com/opentargets/issues/issues/2567#issuecomment-1112155823. So this seems like a change between 22.04 and 22.06? Is this expected?

ireneisdoomed commented 1 year ago

Hi @ravwojdyla!

The number of FDA references that we are losing are now linking out to DailyMed. So the reference is not lost, it is simply provided by a different resource.

You can see it clearly in this example. This is a 21.11 evidence for Warfarin and atrial fibrillation. The reference is to FDA.

 datasourceId              | chembl
 targetId                  | ENSG00000167397
 clinicalPhase             | 4
 datatypeId                | known_drug
 diseaseFromSourceMappedId | EFO_0000275
 drugId                    | CHEMBL1200879
 targetFromSource          | CHEMBL1930
 targetFromSourceId        | Q9BQB6
 urls                      | [{FDA, https://api.fda.gov/drug/label.json?search=set_id:6a0068ba-19bd-4711-b894-868fd007a613}]
 diseaseId                 | EFO_0000275
 score                     | 1.0
 sourceId                  | chembl

This is the corresponding 22.06 record, where the reference is to Dailymed.

 datasourceId              | chembl
 targetId                  | ENSG00000167397
 clinicalPhase             | 4
 datatypeId                | known_drug
 diseaseFromSourceMappedId | EFO_0000275
 drugId                    | CHEMBL1200879
 studyStopReasonCategories | null
 targetFromSource          | CHEMBL1930
 targetFromSourceId        | Q9BQB6
 urls                      | [{DailyMed, https://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=6a0068ba-19bd-4711-b894-868fd007a613}]
 diseaseId                 | EFO_0000275
 id                        | a32f68ed197f5d9c8a49e65b122105926a4676ff
 score                     | 1.0
 sourceId                  | chembl

OpenFDA and DailyMed are the same sources of medicinal product labels that have been submitted to FDA. The content of DailyMed is provided by FDA, so the only difference is the url link. Note that the unique label identifier is the set_id: 6a0068ba-19bd-4711-b894-868fd007a613.

So the set_ids is a good way of comparing the references. I did the exercise of looking at these set IDs to make sure we weren't losing references, and this yielded one lost record only, which might be affected by the newer automatic DailyMed pipeline. This has been shared with my colleagues at ChEMBL. The snippet in PySpark, in case it is useful:

evidence_2111_path = '21.11/evidence/sourceId=chembl/'
evidence_2206_path = '22.06/evidence/sourceId=chembl/

def extract_set_id(df: DataFrame) -> DataFrame:
    df = (
        spark.read.parquet(df)
        .withColumn('url', F.explode('urls'))
        .withColumn(
            'set_id',
            F.when(
                (F.col('url.niceName') == 'FDA') & (F.col('url.url').contains('set_id')),
                F.element_at(F.split('url.url', '\:'), -1)
            )
            .when(
                (F.col('url.niceName') == 'DailyMed') & (F.col('url.url').contains('setid')),
                F.element_at(F.split('url.url', '='), -1)))
    )

def compare_setids(df1: DataFrame, df2: DataFrame) -> DataFrame:

    df1 = extract_set_id(df1).filter(F.col('set_id').isNotNull()).select('drugId', 'set_id').distinct()
    df2 = extract_set_id(df2).filter(F.col('set_id').isNotNull()).select('drugId', 'set_id').distinct()

    return (
        df1.join(df2, on='set_id', how='left_anti')
    )

ravwojdyla commented 1 year ago

Note that the unique label identifier is the set_id: 6a0068ba-19bd-4711-b894-868fd007a613.

Ah, great to know about the shared setid! This is very helpful and great to hear that this is not an issue. Thank you again @ireneisdoomed!

opentargets / issues

Substantially smaller number of FDA approved drugs in 22.06 #2694