Open d0choa opened 4 months ago
There is more context here about what's changed from 2.1 -> 4.0 https://gnomad.broadinstitute.org/news/2024-03-gnomad-v4-0-gene-constraint/
Not exactly sure what columns are used from these files, but they are quite different. Mostly _
to .
conversions though (eg. syn_z
to syn.z_raw
). It's possible to map, however I'm not sure if all the details are captured.
GNOMAD2_FILE='gs://gcp-public-data--gnomad/release/2.1.1/constraint/gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz'
GNOMAD4_FILE='gs://gcp-public-data--gnomad/release/4.1/constraint/gnomad.v4.1.constraint_metrics.tsv'
diff -i -U1000 \
<( gsutil cat ${GNOMAD4_FILE} | head -1 | tr "\t" "\n" | sort ) \
<( gsutil cat ${GNOMAD2_FILE} | gzcat | head -1 | tr "\t" "\n" | sort)
Headers:
-canonical
+brain_expression
cds_length
chromosome
-constraint_flags
+classic_caf
+classic_caf_afr
+classic_caf_amr
+classic_caf_asj
+classic_caf_eas
+classic_caf_fin
+classic_caf_nfe
+classic_caf_oth
+classic_caf_sas
+constraint_flag
+defined
+end_position
+exac_exp_lof
+exac_obs_lof
+exac_oe_lof
+exac_pLI
+exp_hom_lof
+exp_lof
+exp_mis
+exp_mis_pphen
+exp_syn
gene
gene_id
-level
-lof.exp
-lof.mu
-lof.obs
-lof.oe
-lof.oe_ci.lower
-lof.oe_ci.upper
-lof.oe_ci.upper_bin_decile
-lof.oe_ci.upper_rank
-lof.pLI
-lof.pNull
-lof.pRec
-lof.possible
-lof.z_raw
-lof.z_score
-lof_hc_lc.exp
-lof_hc_lc.mu
-lof_hc_lc.obs
-lof_hc_lc.oe
-lof_hc_lc.pLI
-lof_hc_lc.pNull
-lof_hc_lc.pRec
-lof_hc_lc.possible
-mane_select
-mis.exp
-mis.mu
-mis.obs
-mis.oe
-mis.oe_ci.lower
-mis.oe_ci.upper
-mis.possible
-mis.z_raw
-mis.z_score
-mis_pphen.exp
-mis_pphen.obs
-mis_pphen.oe
-mis_pphen.possible
+gene_length
+gene_type
+lof_z
+max_af
+mis_z
+mu_lof
+mu_mis
+mu_syn
+n_sites
+no_lofs
num_coding_exons
-syn.exp
-syn.mu
-syn.obs
-syn.oe
-syn.oe_ci.lower
-syn.oe_ci.upper
-syn.possible
-syn.z_raw
-syn.z_score
+obs_het_lof
+obs_hom_lof
+obs_lof
+obs_mis
+obs_mis_pphen
+obs_syn
+oe_lof
+oe_lof_lower
+oe_lof_upper
+oe_lof_upper_bin
+oe_lof_upper_bin_6
+oe_lof_upper_rank
+oe_mis
+oe_mis_lower
+oe_mis_pphen
+oe_mis_upper
+oe_syn
+oe_syn_lower
+oe_syn_upper
+p
+pLI
+pNull
+pRec
+p_afr
+p_amr
+p_asj
+p_eas
+p_fin
+p_nfe
+p_oth
+p_sas
+possible_lof
+possible_mis
+possible_mis_pphen
+possible_syn
+start_position
+syn_z
transcript
+transcript_level
transcript_type
@jdhayhurst can you please take a look at this issue?
The following changes will be required:
Before I can start, I will need help from the @opentargets/data-team to know how to replace the headers in the ETL with the version 4 headers (the select statement is (here).
This assumes that it's as straightforward as mapping headers from gnomad 2 to 4, but I don't know if that's the case. If it's not the case, we may need to make more substantial changes to the ETL.
Hi @jdhayhurst , These are the current mappings:
struct(
lit("syn").as("constraintType"),
col("syn.z_score").cast(FloatType).as("score"),
col("syn.exp").cast(FloatType).as("exp"),
col("syn.obs").cast(IntegerType).as("obs"),
col("osyn.oe").cast(FloatType).as("oe"),
col("syn.oe_ci.lower").cast(FloatType).as("oeLower"),
col("syn.oe_ci.upper").cast(FloatType).as("oeUpper"),
lit(null).as("upperRank"),
lit(null).as("upperBin"),
),
struct(
lit("mis").as("constraintType"),
col("mis.z_score").cast(FloatType).as("score"),
col("mis.exp").cast(FloatType).as("exp"),
col("mis.obs").cast(IntegerType).as("obs"),
col("mis.oe").cast(FloatType).as("oe"),
col("mis.oe_ci.lower").cast(FloatType).as("oeLower"),
col("mis.oe_ci.upper").cast(FloatType).as("oeUpper"),
lit(null).as("upperRank"),
lit(null).as("upperBin"),
),
struct(
lit("lof").as("constraintType"),
col("lof.pLI").cast(FloatType).as("score"),
col("lof.exp").cast(FloatType).as("exp"),
col("lof.obs").cast(IntegerType).as("obs"),
col("lof.oe").cast(FloatType).as("oe"),
col("lof.oe_ci.lower").cast(FloatType).as("oeLower"),
col("lof.oe_ci.upper").cast(FloatType).as("oeUpper"),
col("lof.oe_ci.upper_rank").cast(IntegerType).as("upperRank"),
col("lof.oe_ci.upper_bin_decile").cast(IntegerType).as("upperBin"),
)
As discussed with @d0choa , we are dropping "upperBin6"
column from the schema as the new dataset has no equivalent column. It poses some complexity on the frontend, because the currently used five-star system requires sextiles.
Keep in mind this is not a burning issue right now and @prashantuniyal02 might de-prioritise it observing our main focus around genetics integration.
The Open Targets Platform has been using GnomAD 2.1.1 as an input for the genetic constraint data. [PIS config].
It would be good to update it to use at least GnomAD 4.1 e.g.
gs://gcp-public-data--gnomad/release/4.1/constraint/gnomad.v4.1.constraint_metrics.tsv
. Without looking too much into the schemas I wouldn't expect much work to get this done.From a data perspective, I noticed the genetic constraint doesn't feature in their web for GnomAD v4.1 but it did for gnomAD 2.1. I suspect this is an intentional decision from the gnomAD team to avoid false positives in their assessments.
Assigning to @prashantuniyal02 in the first place pending further prioritisation of the task