opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Update gnomAD genetic constraint dataset in platform to use version 4 #3317

Open d0choa opened 4 months ago

d0choa commented 4 months ago

The Open Targets Platform has been using GnomAD 2.1.1 as an input for the genetic constraint data. [PIS config].

It would be good to update it to use at least GnomAD 4.1 e.g. gs://gcp-public-data--gnomad/release/4.1/constraint/gnomad.v4.1.constraint_metrics.tsv. Without looking too much into the schemas I wouldn't expect much work to get this done.

From a data perspective, I noticed the genetic constraint doesn't feature in their web for GnomAD v4.1 but it did for gnomAD 2.1. I suspect this is an intentional decision from the gnomAD team to avoid false positives in their assessments.

Assigning to @prashantuniyal02 in the first place pending further prioritisation of the task

d0choa commented 4 months ago

There is more context here about what's changed from 2.1 -> 4.0 https://gnomad.broadinstitute.org/news/2024-03-gnomad-v4-0-gene-constraint/

DSuveges commented 4 months ago

Not exactly sure what columns are used from these files, but they are quite different. Mostly _ to . conversions though (eg. syn_z to syn.z_raw). It's possible to map, however I'm not sure if all the details are captured.

GNOMAD2_FILE='gs://gcp-public-data--gnomad/release/2.1.1/constraint/gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz'
GNOMAD4_FILE='gs://gcp-public-data--gnomad/release/4.1/constraint/gnomad.v4.1.constraint_metrics.tsv'

diff -i -U1000 \
    <( gsutil cat ${GNOMAD4_FILE} | head -1 | tr "\t" "\n" | sort ) \
    <( gsutil cat ${GNOMAD2_FILE} | gzcat | head -1 | tr "\t" "\n" | sort)

Headers:

-canonical
+brain_expression
 cds_length
 chromosome
-constraint_flags
+classic_caf
+classic_caf_afr
+classic_caf_amr
+classic_caf_asj
+classic_caf_eas
+classic_caf_fin
+classic_caf_nfe
+classic_caf_oth
+classic_caf_sas
+constraint_flag
+defined
+end_position
+exac_exp_lof
+exac_obs_lof
+exac_oe_lof
+exac_pLI
+exp_hom_lof
+exp_lof
+exp_mis
+exp_mis_pphen
+exp_syn
 gene
 gene_id
-level
-lof.exp
-lof.mu
-lof.obs
-lof.oe
-lof.oe_ci.lower
-lof.oe_ci.upper
-lof.oe_ci.upper_bin_decile
-lof.oe_ci.upper_rank
-lof.pLI
-lof.pNull
-lof.pRec
-lof.possible
-lof.z_raw
-lof.z_score
-lof_hc_lc.exp
-lof_hc_lc.mu
-lof_hc_lc.obs
-lof_hc_lc.oe
-lof_hc_lc.pLI
-lof_hc_lc.pNull
-lof_hc_lc.pRec
-lof_hc_lc.possible
-mane_select
-mis.exp
-mis.mu
-mis.obs
-mis.oe
-mis.oe_ci.lower
-mis.oe_ci.upper
-mis.possible
-mis.z_raw
-mis.z_score
-mis_pphen.exp
-mis_pphen.obs
-mis_pphen.oe
-mis_pphen.possible
+gene_length
+gene_type
+lof_z
+max_af
+mis_z
+mu_lof
+mu_mis
+mu_syn
+n_sites
+no_lofs
 num_coding_exons
-syn.exp
-syn.mu
-syn.obs
-syn.oe
-syn.oe_ci.lower
-syn.oe_ci.upper
-syn.possible
-syn.z_raw
-syn.z_score
+obs_het_lof
+obs_hom_lof
+obs_lof
+obs_mis
+obs_mis_pphen
+obs_syn
+oe_lof
+oe_lof_lower
+oe_lof_upper
+oe_lof_upper_bin
+oe_lof_upper_bin_6
+oe_lof_upper_rank
+oe_mis
+oe_mis_lower
+oe_mis_pphen
+oe_mis_upper
+oe_syn
+oe_syn_lower
+oe_syn_upper
+p
+pLI
+pNull
+pRec
+p_afr
+p_amr
+p_asj
+p_eas
+p_fin
+p_nfe
+p_oth
+p_sas
+possible_lof
+possible_mis
+possible_mis_pphen
+possible_syn
+start_position
+syn_z
 transcript
+transcript_level
 transcript_type
prashantuniyal02 commented 3 weeks ago

@jdhayhurst can you please take a look at this issue?

jdhayhurst commented 3 weeks ago

The following changes will be required:

  1. update the PIS conf to source the updated file
  2. update the ETL with the new column names.

Before I can start, I will need help from the @opentargets/data-team to know how to replace the headers in the ETL with the version 4 headers (the select statement is (here).

This assumes that it's as straightforward as mapping headers from gnomad 2 to 4, but I don't know if that's the case. If it's not the case, we may need to make more substantial changes to the ETL.

DSuveges commented 2 weeks ago

Hi @jdhayhurst , These are the current mappings:

struct(
  lit("syn").as("constraintType"),
  col("syn.z_score").cast(FloatType).as("score"),
  col("syn.exp").cast(FloatType).as("exp"),
  col("syn.obs").cast(IntegerType).as("obs"),
  col("osyn.oe").cast(FloatType).as("oe"),
  col("syn.oe_ci.lower").cast(FloatType).as("oeLower"),
  col("syn.oe_ci.upper").cast(FloatType).as("oeUpper"),
  lit(null).as("upperRank"),
  lit(null).as("upperBin"),
),
struct(
  lit("mis").as("constraintType"),
  col("mis.z_score").cast(FloatType).as("score"),
  col("mis.exp").cast(FloatType).as("exp"),
  col("mis.obs").cast(IntegerType).as("obs"),
  col("mis.oe").cast(FloatType).as("oe"),
  col("mis.oe_ci.lower").cast(FloatType).as("oeLower"),
  col("mis.oe_ci.upper").cast(FloatType).as("oeUpper"),
  lit(null).as("upperRank"),
  lit(null).as("upperBin"),
),
struct(
  lit("lof").as("constraintType"),
  col("lof.pLI").cast(FloatType).as("score"),
  col("lof.exp").cast(FloatType).as("exp"),
  col("lof.obs").cast(IntegerType).as("obs"),
  col("lof.oe").cast(FloatType).as("oe"),
  col("lof.oe_ci.lower").cast(FloatType).as("oeLower"),
  col("lof.oe_ci.upper").cast(FloatType).as("oeUpper"),
  col("lof.oe_ci.upper_rank").cast(IntegerType).as("upperRank"),
  col("lof.oe_ci.upper_bin_decile").cast(IntegerType).as("upperBin"),
)

As discussed with @d0choa , we are dropping "upperBin6" column from the schema as the new dataset has no equivalent column. It poses some complexity on the frontend, because the currently used five-star system requires sextiles.

Keep in mind this is not a burning issue right now and @prashantuniyal02 might de-prioritise it observing our main focus around genetics integration.