opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Substantially smaller number of FDA approved drugs in 22.06 #2694

Closed ravwojdyla closed 1 year ago

ravwojdyla commented 1 year ago

I'm comparing release 21.11 and 22.06 as part of update process, and have noticed a substantial decrease in the number of FDA approved drugs:

 source_name 21.11 22.06 pct_change
ctgov 463,775 486,943 0.04995525848
atc 4,430 4,482 0.01173814898
fda 21,166 1,476 (0.9302655202)
dailymed 6,348 103,692 15.33459357

When looking at the unique drug IDs, source pairs:

source_name 21.11 22.06 pct_change
atc 1227 1243 0.01
ctgov 3481 3710 0.07
dailymed 835 1367 0.64
fda 740 360 (0.51)

My first thought was that maybe this was related to this change: https://blog.opentargets.org/open-targets-platform-22-04-release/#druglabels? Except https://github.com/opentargets/issues/issues/2567 would suggest otherwise, given the stats in https://github.com/opentargets/issues/issues/2567#issuecomment-1112155823. So this seems like a change between 22.04 and 22.06? Is this expected?

ireneisdoomed commented 1 year ago

Hi @ravwojdyla!

The number of FDA references that we are losing are now linking out to DailyMed. So the reference is not lost, it is simply provided by a different resource.

You can see it clearly in this example. This is a 21.11 evidence for Warfarin and atrial fibrillation. The reference is to FDA.

 datasourceId              | chembl
 targetId                  | ENSG00000167397
 clinicalPhase             | 4
 datatypeId                | known_drug
 diseaseFromSourceMappedId | EFO_0000275
 drugId                    | CHEMBL1200879
 targetFromSource          | CHEMBL1930
 targetFromSourceId        | Q9BQB6
 urls                      | [{FDA, https://api.fda.gov/drug/label.json?search=set_id:6a0068ba-19bd-4711-b894-868fd007a613}]
 diseaseId                 | EFO_0000275
 score                     | 1.0
 sourceId                  | chembl

This is the corresponding 22.06 record, where the reference is to Dailymed.

 datasourceId              | chembl
 targetId                  | ENSG00000167397
 clinicalPhase             | 4
 datatypeId                | known_drug
 diseaseFromSourceMappedId | EFO_0000275
 drugId                    | CHEMBL1200879
 studyStopReasonCategories | null
 targetFromSource          | CHEMBL1930
 targetFromSourceId        | Q9BQB6
 urls                      | [{DailyMed, https://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=6a0068ba-19bd-4711-b894-868fd007a613}]
 diseaseId                 | EFO_0000275
 id                        | a32f68ed197f5d9c8a49e65b122105926a4676ff
 score                     | 1.0
 sourceId                  | chembl

OpenFDA and DailyMed are the same sources of medicinal product labels that have been submitted to FDA. The content of DailyMed is provided by FDA, so the only difference is the url link. Note that the unique label identifier is the set_id: 6a0068ba-19bd-4711-b894-868fd007a613.

So the set_ids is a good way of comparing the references. I did the exercise of looking at these set IDs to make sure we weren't losing references, and this yielded one lost record only, which might be affected by the newer automatic DailyMed pipeline. This has been shared with my colleagues at ChEMBL. The snippet in PySpark, in case it is useful:

evidence_2111_path = '21.11/evidence/sourceId=chembl/'
evidence_2206_path = '22.06/evidence/sourceId=chembl/

def extract_set_id(df: DataFrame) -> DataFrame:
    df = (
        spark.read.parquet(df)
        .withColumn('url', F.explode('urls'))
        .withColumn(
            'set_id',
            F.when(
                (F.col('url.niceName') == 'FDA') & (F.col('url.url').contains('set_id')),
                F.element_at(F.split('url.url', '\:'), -1)
            )
            .when(
                (F.col('url.niceName') == 'DailyMed') & (F.col('url.url').contains('setid')),
                F.element_at(F.split('url.url', '='), -1)))
    )

def compare_setids(df1: DataFrame, df2: DataFrame) -> DataFrame:

    df1 = extract_set_id(df1).filter(F.col('set_id').isNotNull()).select('drugId', 'set_id').distinct()
    df2 = extract_set_id(df2).filter(F.col('set_id').isNotNull()).select('drugId', 'set_id').distinct()

    return (
        df1.join(df2, on='set_id', how='left_anti')
    )
ravwojdyla commented 1 year ago

Note that the unique label identifier is the set_id: 6a0068ba-19bd-4711-b894-868fd007a613.

Ah, great to know about the shared setid! This is very helpful and great to hear that this is not an issue. Thank you again @ireneisdoomed!