Open yk-tanigawa opened 4 years ago
Focusing on the finalized summary statistic files, we started LDSC munge.
gwas-current-gz-wc.20200704-155715.annotated.tsv
. Please see #21 for details.There are 20,940 such files across 7 populations and pushed the computation.
As of now,
Please see the analysis scripts for more info.
It turned out that there was an issue in filtering conditions and we are computing LDSC munge for all sum stats in gwas/current directory.
We now have 19,163+ munged sumstats (3,669 for WB).
Once GWAS is finalized, we can identify the updated sum stats (~1,100 in total; ~880 will be overwritten and ~230 will be added) and re-apply LDSC munge.
We considered applying LDSC munge for the meta-analyzed summary statistics (to get a phenotyping mapping for #25), but we decided to use the WB sum stats for mapping between FinnGen and UKB
Files are in /oak/stanford/groups/mrivas/ukbb24983/array-combined/ldsc
With progress on #21, we should refresh this and update the #27 analysis
In 1_remove-incomplete-20200713.sh
, we fixed the previous error in the filtering condition.
In the original version of 1_generate_input_list.sh, we incorrectly specified `NR>1 || $NF == 1080969`, but it should have been `NR>1 && $NF == 1080969`. This results resulted in 909 extra munged files.
Those were NOT used in the heritability analysis. In this script, we remove those 909 files.
With the finalized GWAS results (#21), we apply LDSC munge again.
1_LDSC_munge.20200717-210250.job.lst
has 2714 files. = 905 * 3
bash 1_generate_input_list.sh | tee 1_LDSC_munge.$(date +%Y%m%d-%H%M%S).job.lst | tee /dev/stderr | wc -l
ml load resbatch
ml R/3.6 gcc
sbatch -p mrivas,normal,owners --time=3:00:00 --mem=8000 --nodes=1 --cores=1 --job-name=munge --output=logs/munge.%A_%a.out --error=logs/munge.%A_%a.err --array=1-905 $parallel_sbatch_sh 1_LDSC_munge.sh 1_LDSC_munge.20200717-210250.job.lst 3
Submitted batch job 4255901
find /oak/stanford/groups/mrivas/ukbb24983/array-combined/ldsc -type f -name "*.gz" | wc -l
20295
ml load resbatch R/3.6 gcc
sbatch -p mrivas,normal,owners --time=3:00:00 --mem=8000 --nodes=1 --cores=1 --job-name=munge --output=logs/munge.%A_%a.out --error=logs/munge.%A_%a.err --array=1-1000 $parallel_sbatch_sh 1_LDSC_munge.sh 1_LDSC_munge.20200717-231130.job.lst 1
# Submitted batch job 4260541
sbatch -p mrivas,normal,owners --time=3:00:00 --mem=8000 --nodes=1 --cores=1 --job-name=munge --output=logs/munge.%A_%a.out --error=logs/munge.%A_%a.err --array=1-877 $parallel_sbatch_sh 1_LDSC_munge.sh 1_LDSC_munge.20200717-231130.job.part2.lst 1
# Submitted batch job 4260621
We also apply LDSC munge on the meta-analyzed sumstats.
ml load R/3.6 gcc resbatch
sbatch -p mrivas,normal,owners --time=1:00:00 --mem=8000 --nodes=1 --cores=1 --job-name=munge_meta --output=logs/munge_meta.%A_%a.out --error=logs/munge_meta.%A_%a.err --array=1-949 $parallel_sbatch_sh 1b_LDSC_munge.sh 1_LDSC_munge.20200718-134522.metal.job.lst 4
Submitted batch job 4279977
There are some failed files...
find /oak/stanford/groups/mrivas/ukbb24983/array-combined/ldsc/metal/
-type f -name "*.gz" | wc
3417 3417 399695
[ytanigaw@sh02-09n54 ~/repos/rivas-lab/ukbb-tools/07_LDSC/jobs/202007_LDSC]$ wc 1_LDSC_munge.20200718-134522.metal.job.lst
3794 3794 340971 1_LDSC_munge.20200718-134522.metal.job.lst
An update on this - as a result of needing to run the pairwise rg
calculations across all traits, I needed to convert all of the summary statistics to the munged format. I've tabulated the phenotypes for which the sumstats munge failed, with an error similar to the following:
Traceback (most recent call last):
File "/opt/ldsc/munge_sumstats.py", line 701, in munge_sumstats
check_median(dat.SIGNED_SUMSTAT, signed_sumstat_null, 0.1, sign_cname))
File "/opt/ldsc/munge_sumstats.py", line 373, in check_median
raise ValueError(msg.format(F=name, M=expected_median, V=round(m, 2)))
ValueError: WARNING: median value of SIGNED_SUMSTATS is 0.11 (should be close to 0.0). This column may be mislabeled.
These are at https://github.com/rivas-lab/ukbb-tools/blob/master/07_LDSC/helpers/affected_metal_traits.txt.
A quick check on the gwas.qc.tsv file for the array-combined dataset indicates these are summary statistics that are low-N traits overall.
We convert the UKB sumstats into LDSC munge format.
This will enable us to perform