Weight file allele annotation

jgockley62 commented 2 years ago

Then function [make_df] (https://github.com/hakyimlab/psychencode/blob/55ed82872238c7cad246fe3a61b8354cb45eb63e/analysis/generate_weights.Rmd#L72) incorrectly labels alleles from V5 and V6 as reference and Alternate alternate alleles respectively. Fusion weights instead store alleles as Effect allele (corresponding to their weight) and Alt (non-effect allele) in columns V5 and V6 respectively. To find out which allele corresponds to the reference genome (REF allele) you need to match the locus back to the reference genome to find if the effect or non-effect allele is the reference allele or alternate allele with respect to the reference genome.

An example can be found in: "psychencode/data/PEC_TWAS_weights/ENSG00000273492.wgt.RDat"

In the third SNP: 21 21:27043998 0 27043998 G T

Where the hg19 reference allele from UCSC genome browser is T and the Alt is G.

If you implement allele matching in the annotation script to combat rare multi-allelic sites becoming an issue in alternate cohorts this becomes an important feature.

To that end do you have an available version of dbSNP150_list.txt.gz to annotate rsID to the weight files?

hakyim commented 2 years ago

Here is Sabrina’s response

Thank you for letting us know! We realized that we had mixed up the reference and alternate alleles when we cross-validated with PrediXcan. I must’ve forgot to update the Github, but I will go back and double check the FUSION weights. I will also add the link to the dbSNP file: https://uchicago.box.com/s/twr1igkhpfbnz7n2mjqhpyaon47w1hzm

On Wed, Jun 15, 2022 at 5:59 PM Jake Gockley @.***> wrote:

Then function [make_df] ( https://github.com/hakyimlab/psychencode/blob/55ed82872238c7cad246fe3a61b8354cb45eb63e/analysis/generate_weights.Rmd#L72) incorrectly labels alleles from V5 and V6 as reference and Alternate alternate alleles respectively. Fusion weights instead store alleles as Effect allele (corresponding to their weight) and Alt (non-effect allele) in columns V5 and V6 respectively. To find out which allele corresponds to the reference genome (REF allele) you need to match the locus back to the reference genome to find if the effect or non-effect allele is the reference allele or alternate allele with respect to the reference genome.

An example can be found in: "psychencode/data/PEC_TWAS_weights/ENSG00000273492.wgt.RDat"

In the third SNP: 21 21:27043998 0 27043998 G T

Where the hg19 reference allele from UCSC genome browser is T and the Alt is G.

If you implement allele matching in the annotation script https://github.com/liangyy/misc-tools/blob/master/annotate_snp_by_position/annotate_snp_by_position.py to combat rare multi-allelic sites becoming an issue in alternate cohorts this becomes an important feature.

To that end do you have an available version of dbSNP150_list.txt.gz to annotate rsID to the weight files?

make_df <- function(file) { load(file) weights <- data.frame(wgt.matrix) snps <- data.frame(snps) rownames(weights) <- c() weights$gene <- substr(file, 1, nchar(file) - 9) weights$chromosome <- snps$V1 weights$position <- snps$V4 weights$ref_allele <- snps$V5 weights$eff_allele <- snps$V6 weights %>% filter(enet != 0) %>% select(gene, chromosome, position, ref_allele, eff_allele, enet) }

— Reply to this email directly, view it on GitHub https://github.com/hakyimlab/psychencode/issues/4, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAW2ROKQU26CFN2QOTBKE6LVPJN63ANCNFSM5Y44VYTA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jgockley62 commented 2 years ago

Thank you so much! Apologies for the nit picky nomenclature issues, it's just really easy to get turned around when applying weight models to different data sets. The dbSNP file is very helpful!!

hakyimlab / psychencode

Weight file allele annotation #4