martinjzhang / scDRS

Single-cell disease relevance score (scDRS)
https://martinjzhang.github.io/scDRS/
MIT License
105 stars 13 forks source link

custom gene set? #2

Closed pvtodorov closed 2 years ago

pvtodorov commented 2 years ago

Hi!

I'm trying to figure out how to generate a custom geneset file using MAGMA that could plug into scDRS. I found the following text within the manuscript

We use MAGMA20 502 to compute gene-level association p-values from disease 503 GWAS summary statistics (Box 1, step 1). We use a reference panel based on individuals of European ancestry in the 1000 Genomes Project97 504 . We use a 10-kb window around the gene body to map SNPs to genes. We select the top 1,000 genes based 505 on MAGMA p-values as putative disease genes.

Could you provide an example of the MAGMA inputs and commands that performs this?

KangchengHou commented 2 years ago

Thanks for your interest!

I attach a guide to compute MAGMA gene sets from GWAS summary statistics:

# Step 1: download MAGMA software, gene location file, and reference data from
# https://ctg.cncr.nl/software/magma after this step, one should have a folder <MAGMA_DIR>
# with the following files:
# 1) <MAGMA_DIR>/magma 2) <MAGMA_DIR>/g1000_eur.(bed|bim|fam) 3) <MAGMA_DIR>/NCBI37.3.gene.loc

magma_dir=<MAGMA_DIR>

# Step 2: make gene annotation file for MAGMA using the following command, this only needs to be done
# once for different GWAS summary statistics, and the file will be saved to out/step1.genes.annot
mkdir -p out/step1
${magma_dir}/magma \
    --annotate window=10,10 \
    --snp-loc ${magma_dir}/g1000_eur.bim \
    --gene-loc ${magma_dir}/NCBI37.3.gene.loc \
    --out out/step1

# Step 3: run MAGMA using the following command, this takes a GWAS file ${trait}.pval,
# which at least has the columns: SNP, P, N, which corresponds to the SNP id
# (matched to the ${magma_dir}/g1000_eur.bim), p-value, sample size. For example,
# <trait>.pval file looks like
#
# CHR     BP      SNP             P           N
# 1       717587  rs144155419     0.453345    279949
# 1       740284  rs61770167      0.921906    282079
# 1       769223  rs60320384      0.059349    281744
#
# After this step, one should obtain a file out/step2/${trait}.gene.out, and the top genes with
# largest Z-scores can be input to scDRS.

mkdir -p out/step2
${magma_dir}/magma \
    --bfile ${magma_dir}/g1000_eur \
    --pval ${trait}.pval use='SNP,P' ncol='N' \
    --gene-annot out/step1.genes.annot \
    --out out/step2/${trait}

We will update our document correspondingly.

pvtodorov commented 2 years ago

Wow! Awesome & thank you for quick response 🚀

jaredslosberg commented 2 years ago

Hi Kangcheng. Do you have any documentation on your .sumstats files? Specifically, are the CHISQ and Z columns from the GWAS, or an output of a step within scDRS?

Here is the first few lines of PASS_IBD_deLange2017.sumstats, for example:

SNP A1 A2 N CHISQ Z rs7899632 A G 59957 3.4299 -1.852 rs3750595 A C 59957 3.3124 1.82 rs10786405 T C 59957 3.4559 1.859 rs3793692 A G 59957 0.119 -0.345 rs737657 A G 59957 0.0912 -0.302 rs17524355 T C 59957 4.02 -2.005 rs7086391 T C 59957 3.7249 -1.93

martinjzhang commented 2 years ago

Hi @jaredslosberg

Those are from GWAS and follow the LDSC format https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format

Let us know if you have further questions/suggestions.

Best, Martin

twoneu commented 2 years ago

Hi @martinjzhang! Thank you for this awesome package.

Do you have a recommended method for converting the Entrez IDs produced by MAGMA to gene symbols? It seems that the compute-score tool does not recognize the Entrez IDs (First 3 elements for 'TRAIT': [], []) and returns a segmentation fault.

This is what my geneset looks like after following the tutorial above:

image
KangchengHou commented 2 years ago

@twoneu MAGMA's NCBI37.3.gene.loc (e.g., see https://ctg.cncr.nl/software/MAGMA/aux_files/NCBI37.3.zip) has 1st column as entrez IDs and last column as gene symbol.

Also adding that the gene symbols in the .gs file should be consistent with var_names in AnnData.

twoneu commented 2 years ago

Hi Kangcheng. Do you have any documentation on your .sumstats files? Specifically, are the CHISQ and Z columns from the GWAS, or an output of a step within scDRS?

Here is the first few lines of PASS_IBD_deLange2017.sumstats, for example:

SNP A1 A2 N CHISQ Z rs7899632 A G 59957 3.4299 -1.852 rs3750595 A C 59957 3.3124 1.82 rs10786405 T C 59957 3.4559 1.859 rs3793692 A G 59957 0.119 -0.345 rs737657 A G 59957 0.0912 -0.302 rs17524355 T C 59957 4.02 -2.005 rs7086391 T C 59957 3.7249 -1.93

Sorry, another question - how would you recommend that we generate genesets using MAGMA for summary statistics like these with no P-value column?

Thank you so much for your time!

KangchengHou commented 2 years ago

@twoneu are there other columns containing information needed to derive a p-value? We can help if you provide us information about columns you have. Also see MungeSumstats which help standardize sumstats.

twoneu commented 2 years ago

The only columns are SNP, A1, A2, N, Z, and CHISQ (see PASS_IBD_deLange2017.sumstats as an example). Mungesumstats seems to require a P value to run as well.

KangchengHou commented 2 years ago

@twoneu You can use the following function to convert z-score to p-value.

import scipy.stats
import numpy as np
# zscore can be a numpy array
pval = 2 * (1 - scipy.stats.norm.cdf(np.abs(zscore)))

You need to manually add a column to the sumstats file.

UPDATE NOTE: My previous reply pval = scdrs.util.zsc2pval(zsc) is for computing one-sided p-value. But in GWAS we typically do two-sided p-value (as in the updated reply).

KangchengHou commented 2 years ago

@twoneu please see the updated reply above (i made a mistake earlier: we should compute two-sided p-value instead one-sided p-value to convert GWAS z-score to p-value)

pariaaliour commented 7 months ago

Thanks for your interest!

I attach a guide to compute MAGMA gene sets from GWAS summary statistics:

# Step 1: download MAGMA software, gene location file, and reference data from
# https://ctg.cncr.nl/software/magma after this step, one should have a folder <MAGMA_DIR>
# with the following files:
# 1) <MAGMA_DIR>/magma 2) <MAGMA_DIR>/g1000_eur.(bed|bim|fam) 3) <MAGMA_DIR>/NCBI37.3.gene.loc

magma_dir=<MAGMA_DIR>

# Step 2: make gene annotation file for MAGMA using the following command, this only needs to be done
# once for different GWAS summary statistics, and the file will be saved to out/step1.genes.annot
mkdir -p out/step1
${magma_dir}/magma \
    --annotate window=10,10 \
    --snp-loc ${magma_dir}/g1000_eur.bim \
    --gene-loc ${magma_dir}/NCBI37.3.gene.loc \
    --out out/step1

# Step 3: run MAGMA using the following command, this takes a GWAS file ${trait}.pval,
# which at least has the columns: SNP, P, N, which corresponds to the SNP id
# (matched to the ${magma_dir}/g1000_eur.bim), p-value, sample size. For example,
# <trait>.pval file looks like
#
# CHR     BP      SNP             P           N
# 1       717587  rs144155419     0.453345    279949
# 1       740284  rs61770167      0.921906    282079
# 1       769223  rs60320384      0.059349    281744
#
# After this step, one should obtain a file out/step2/${trait}.gene.out, and the top genes with
# largest Z-scores can be input to scDRS.

mkdir -p out/step2
${magma_dir}/magma \
    --bfile ${magma_dir}/g1000_eur \
    --pval ${trait}.pval use='SNP,P' ncol='N' \
    --gene-annot out/step1.genes.annot \
    --out out/step2/${trait}

We will update our document correspondingly.

Thank you so much for this post! I got a question and appreciate it if you could help me with this; I'm wondering in the second script as I have three files (bim, fam, bed) separately how do I put it here --bfile ${magma_dir}/g1000_eur. I tried to combined them in one file using cat but I keep getting this error: ERROR - parsing arguments: file '/home/paria/scratch/als-project/analysis/ocu_med_sc/rpca/scDRS/MAGMA/g1000_eur.bed' specified in --bfile does not exist or is not a file. I would like to know how to make the g1000_eur including all three .bed, .bim, .fam. Many thanks, Paria

KangchengHou commented 7 months ago

Hi,

You should keep the three files as is: g1000_eur.bed / g1000_eur.bim / g1000_eur.fam

--bfile corresponds to a prefix to the three files

pariaaliour commented 7 months ago

Hi,

You should keep the three files as is: g1000_eur.bed / g1000_eur.bim / g1000_eur.fam

--bfile corresponds to a prefix to the three files

Perfect, thanks for the swift response! I have one more questions if you don't mind. First, based on step 2, the --snp-loc is ${magma_dir}/g1000_eur.bim \ but in MAGMA tutorial (https://ctg.cncr.nl/software/MAGMA/doc/manual_v1.10.pdf) it says the --snp-loc could be provided by user. I was wondering if I have a gwas should the --snp-loc in the step2 --bfile in step 3 be the gwas? Thanks, Paria

KangchengHou commented 7 months ago

Hi,

--snp-loc should be --snp-loc ${magma_dir}/g1000_eur.bim

GWAS should be --pval in step 3

pariaaliour commented 7 months ago

Perfect, thank you!