Genetics API dropping credset results

JarrodBaker commented 2 years ago

Of the ~60m rows in the v2d_credset data, an estimated ~21m records are not able to be displayed in the front-end due to an API/data bug.

The v2d_credset contains a field phenotype_id which sometimes contains an ENSG ID value. The API assumes that this column always contains an ENSG ID and uses it to map to variants. When this field is not an ENSG ID the mapping cannot occur.

Estimated number of records effected by data type:

┌─data_type─┬──count()─┐
│ eqtl      │  4522843 │
│ pqtl      │    53132 │
│ sqtl      │ 20989335 │
└───────────┴──────────┘

The sqtl records will be resolved in opentargets/platform#2688 .

As an example, consider:

SELECT DISTINCT
    study_id,
    phenotype_id,
    lead_variant_id,
    data_type
FROM v2d_credset
WHERE startsWith(phenotype_id, 'ILMN') AND (lead_variant_id = '4_270769_G_A')
LIMIT 10

Query id: 58dcaedb-2d52-4579-a7ee-07f552e4e32f

┌─study_id─┬─phenotype_id─┬─lead_variant_id─┬─data_type─┐
│ CEDAR    │ ILMN_1719158 │ 4_270769_G_A    │ eqtl      │
└──────────┴──────────────┴─────────────────┴───────────┘

Navigating to the web UI (https://genetics.opentargets.org/variant/4_270769_G_A) we can see that there is no listed eQTL data for this study.

Ticket #2688 will add a column gene_id to the database.

This only solves the issues of sQTLs, however all molecular traits should have an ENSG ID to allow these to be displayed on the FE.

Currently the credset data is used as an input to the coloc data pipeline. coloc uses the phenotype_id information to annotate the record with a gene_id. This annotation should be moved up-stream to credset (and potentially further) so that we can reliably link credset data to genes.

No change should be required on the FE or API. The requested changes in opentargets/platform#2688 should automatically apply to this change.

This issue effects production as well as the development environment.

ireneisdoomed commented 2 years ago

For the time being the credible set (pre ETL) has been patched in the same way as reported in #2670 for coloc.

New output: gs://genetics-portal-dev-staging/finemapping/220224_merged/credset Old output: gs://genetics-portal-dev-staging/finemapping/220224_merged/credset_obsolete

The snippet:

credset_path = 'gs://genetics-portal-dev-staging/finemapping/220224_merged/credset_obsolete'
phenotype_gene_path = 'gs://ot-team/dochoa/phenotype_id_gene_luts/'

credset = spark.read.json(credset_path)
phenotype_gene_lut = spark.read.csv(phenotype_gene_path, sep='\t', header=True).select(F.col('phenotype_id').alias('right_phenotype_id'), F.col('gene_id').alias('right_gene_id'))

credset_fix = (
    credset.join(
        phenotype_gene_lut,
        credset['phenotype_id'].eqNullSafe(phenotype_gene_lut['right_phenotype_id']),
        how='left',
    )
    .drop('right_phenotype_id')
    .withColumn(
        'gene_id',
        F.when(F.col('right_gene_id').isNull(), F.regexp_extract('phenotype_id', r'(ENSG\d+)', 1)).otherwise(
            F.col('right_gene_id')
        ),
    )
    .drop('right_gene_id', 'phenotype_id')
    .distinct()
)

credset_fix.repartition(200).write.option('compression', 'gzip').json('gs://genetics-portal-dev-staging/finemapping/220224_merged/credset')

Some checks:

credset.count()
>> 60090436

credset_fix.count() # Again, slight increase due to the 1:M phenotype/gene relationship
>> 60095556

credset_fix.filter(F.col('type') != 'gwas').filter(F.col('gene_id').isNull()).count()
>> 0

credset_fix.filter(F.col('type') == 'gwas').filter(F.col('gene_id').isNull()).count()
>> 5077645

credset_fix.schema == credset.schema
>> False

In this case the fix results in a different schema because we are dropping phenotype_id in favour of gene_id.

credset_fix.printSchema()
>> root
 |-- bio_feature: string (nullable = true)
 |-- is95_credset: boolean (nullable = true)
 |-- is99_credset: boolean (nullable = true)
 |-- lead_alt: string (nullable = true)
 |-- lead_chrom: string (nullable = true)
 |-- lead_pos: long (nullable = true)
 |-- lead_ref: string (nullable = true)
 |-- lead_variant_id: string (nullable = true)
 |-- logABF: double (nullable = true)
 |-- multisignal_method: string (nullable = true)
 |-- postprob: double (nullable = true)
 |-- postprob_cumsum: double (nullable = true)
 |-- study_id: string (nullable = true)
 |-- tag_alt: string (nullable = true)
 |-- tag_beta: double (nullable = true)
 |-- tag_beta_cond: double (nullable = true)
 |-- tag_chrom: string (nullable = true)
 |-- tag_pos: long (nullable = true)
 |-- tag_pval: double (nullable = true)
 |-- tag_pval_cond: double (nullable = true)
 |-- tag_ref: string (nullable = true)
 |-- tag_se: double (nullable = true)
 |-- tag_se_cond: double (nullable = true)
 |-- tag_variant_id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- gene_id: string (nullable = true)

JarrodBaker commented 2 years ago

we are dropping phenotype_id in favour of gene_id.

@ireneisdoomed we want to keep the phenotype_id field, and add the gene_id field. We use the phenotype_id field in the FE.

ireneisdoomed commented 2 years ago

@JarrodBaker File is fixed now. gene_id and phenotype_id are complete. Thanks ^^

@xyg123 Since the credible set is one of L2G's inputs, this means that we will have to rerun L2G, right?

d0choa commented 1 year ago

We will work on this in the new pipeline

opentargets / issues

Genetics API dropping credset results #2689