rivas-lab / ukbb-tools

Tools for preprocessing, QC, and preliminary analyses from raw UK BioBank data
7 stars 0 forks source link

LDSC munge for UKB sumstats #26

Open yk-tanigawa opened 4 years ago

yk-tanigawa commented 4 years ago

We convert the UKB sumstats into LDSC munge format.

This will enable us to perform

yk-tanigawa commented 4 years ago

Focusing on the finalized summary statistic files, we started LDSC munge.

There are 20,940 such files across 7 populations and pushed the computation.

As of now,

Please see the analysis scripts for more info.

yk-tanigawa commented 4 years ago

It turned out that there was an issue in filtering conditions and we are computing LDSC munge for all sum stats in gwas/current directory.

We now have 19,163+ munged sumstats (3,669 for WB).

Once GWAS is finalized, we can identify the updated sum stats (~1,100 in total; ~880 will be overwritten and ~230 will be added) and re-apply LDSC munge.

yk-tanigawa commented 4 years ago

We considered applying LDSC munge for the meta-analyzed summary statistics (to get a phenotyping mapping for #25), but we decided to use the WB sum stats for mapping between FinnGen and UKB

yk-tanigawa commented 4 years ago

Files are in /oak/stanford/groups/mrivas/ukbb24983/array-combined/ldsc

yk-tanigawa commented 4 years ago

With progress on #21, we should refresh this and update the #27 analysis

yk-tanigawa commented 4 years ago

In 1_remove-incomplete-20200713.sh, we fixed the previous error in the filtering condition.

In the original version of 1_generate_input_list.sh, we incorrectly specified `NR>1 || $NF == 1080969`, but it should have been `NR>1 && $NF == 1080969`. This results resulted in 909 extra munged files.
Those were NOT used in the heritability analysis. In this script, we remove those 909 files.
yk-tanigawa commented 4 years ago

With the finalized GWAS results (#21), we apply LDSC munge again.

1_LDSC_munge.20200717-210250.job.lst

has 2714 files. = 905 * 3

bash 1_generate_input_list.sh | tee 1_LDSC_munge.$(date +%Y%m%d-%H%M%S).job.lst | tee /dev/stderr | wc -l

ml load resbatch
ml R/3.6 gcc

sbatch -p mrivas,normal,owners --time=3:00:00 --mem=8000 --nodes=1 --cores=1 --job-name=munge --output=logs/munge.%A_%a.out --error=logs/munge.%A_%a.err --array=1-905 $parallel_sbatch_sh 1_LDSC_munge.sh 1_LDSC_munge.20200717-210250.job.lst 3

Submitted batch job 4255901

yk-tanigawa commented 4 years ago
find /oak/stanford/groups/mrivas/ukbb24983/array-combined/ldsc -type f -name "*.gz" | wc -l
20295
ml load resbatch R/3.6 gcc

sbatch -p mrivas,normal,owners --time=3:00:00 --mem=8000 --nodes=1 --cores=1 --job-name=munge --output=logs/munge.%A_%a.out --error=logs/munge.%A_%a.err --array=1-1000 $parallel_sbatch_sh 1_LDSC_munge.sh 1_LDSC_munge.20200717-231130.job.lst 1

# Submitted batch job 4260541

sbatch -p mrivas,normal,owners --time=3:00:00 --mem=8000 --nodes=1 --cores=1 --job-name=munge --output=logs/munge.%A_%a.out --error=logs/munge.%A_%a.err --array=1-877 $parallel_sbatch_sh 1_LDSC_munge.sh 1_LDSC_munge.20200717-231130.job.part2.lst 1

# Submitted batch job 4260621
yk-tanigawa commented 4 years ago

We also apply LDSC munge on the meta-analyzed sumstats.


ml load R/3.6 gcc resbatch

sbatch -p mrivas,normal,owners --time=1:00:00 --mem=8000 --nodes=1 --cores=1 --job-name=munge_meta --output=logs/munge_meta.%A_%a.out --error=logs/munge_meta.%A_%a.err --array=1-949 $parallel_sbatch_sh 1b_LDSC_munge.sh 1_LDSC_munge.20200718-134522.metal.job.lst 4
Submitted batch job 4279977
yk-tanigawa commented 4 years ago

There are some failed files...

find /oak/stanford/groups/mrivas/ukbb24983/array-combined/ldsc/metal/
-type f -name "*.gz" | wc
   3417    3417  399695

[ytanigaw@sh02-09n54 ~/repos/rivas-lab/ukbb-tools/07_LDSC/jobs/202007_LDSC]$ wc 1_LDSC_munge.20200718-134522.metal.job.lst
  3794   3794 340971 1_LDSC_munge.20200718-134522.metal.job.lst
guhanrv commented 3 years ago

An update on this - as a result of needing to run the pairwise rg calculations across all traits, I needed to convert all of the summary statistics to the munged format. I've tabulated the phenotypes for which the sumstats munge failed, with an error similar to the following:

Traceback (most recent call last):
  File "/opt/ldsc/munge_sumstats.py", line 701, in munge_sumstats
    check_median(dat.SIGNED_SUMSTAT, signed_sumstat_null, 0.1, sign_cname))
  File "/opt/ldsc/munge_sumstats.py", line 373, in check_median
    raise ValueError(msg.format(F=name, M=expected_median, V=round(m, 2)))
ValueError: WARNING: median value of SIGNED_SUMSTATS is 0.11 (should be close to 0.0). This column may be mislabeled.

These are at https://github.com/rivas-lab/ukbb-tools/blob/master/07_LDSC/helpers/affected_metal_traits.txt.

A quick check on the gwas.qc.tsv file for the array-combined dataset indicates these are summary statistics that are low-N traits overall.