Closed yk-tanigawa closed 4 years ago
wc_l | population | n |
---|---|---|
1080969 | african | 2696 |
1080969 | e_asian | 1917 |
1080969 | non_british_white | 3226 |
1080969 | others | 3294 |
1080969 | related | 3357 |
1080969 | s_asian | 2863 |
1080969 | white_british | 3587 |
1080600 | others | 1 |
1080600 | related | 3 |
1080278 | others | 144 |
1080278 | related | 83 |
wc_l | population | n |
---|---|---|
1059397 | african | 216 |
1059397 | e_asian | 252 |
1059397 | non_british_white | 148 |
1059397 | s_asian | 132 |
1059397 | white_british | 132 |
We are computing the sum stats for those in #19
wc_l | population | n |
---|---|---|
1080567 | white_british | 1 |
I thought #17 fixed this file, but it was not the case.
Also, #20 have some fix.
There are 302 files that need to be generated and/or refreshed.
For more information, please check here
$ cat gwas-current-gz-wc.20200704-155715.combined.tsv | awk '$5 < 1059397' | wc
302 2114 43103
$ cat gwas-current-gz-wc.20200704-155715.combined.tsv | awk '$5 < 1059397' | cut -f2 | sort | uniq -c
20 african
96 e_asian
41 non_british_white
12 others
11 related
14 s_asian
108 white_british
So, in terms of the remaining jobs, we have
The patch (#19) generated 824 files with 1080278 lines (-691) because the chrY variants were skipped and one file with 1080969 lines.
Skipping chrY in --glm regression on phenotype 'PHENO1'
Re-computing wc -l
cd ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish
find /oak/stanford/groups/mrivas/ukbb24983/array-combined/gwas/current -name "*.gz" -type l | sort > /oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish/gwas-current-gz-list.$(date +%Y%m%d-%H%M%S).txt
# gwas-current-gz-list.20200711-232847.txt
# 22040 lines
ml load resbatch
ml load R/3.6 gcc
sbatch -p mrivas --qos=high_p --time=1:0:00 --mem=6000 --nodes=1 --cores=1 --job-name=wc --output=logs/wc.%A_%a.out --error=logs/wc.%A_%a.err --array=1-959 $parallel_sbatch_sh gwas-current-gz-wc.sh gwas-current-gz-list.20200711-232847.txt 23
# Submitted batch job 3923943
bash check.missing_pop_GBE.sh
missing_pop_GBE.minN100.20200711-233434.tsv
ToDo --> aggregate the wc -l
following the instruction here
Aggregate the wc -l
results
find logs/ -name "wc.392*err" | parallel 'tail {}' | grep array-end | wc -l
959
bash gwas-current-gz-wc-cat.sh
gwas-current-gz-wc.20200712-071851.tsv
rm gwas-current-gz-list.20200711-232847.txt
$ cat gwas-current-gz-wc.20200712-071851.tsv | awk '(NR>1){print $3}' | sort -nr | uniq -c
20952 1080969
4 1080600
1 1080567
1051 1080278
1 1024521
1 1001756
1 891019
1 890538
1 837659
1 836299
1 779198
1 727614
1 726430
1 725231
1 672894
1 671362
1 454904
1 412155
1 402771
1 400782
1 399908
1 347741
1 345881
1 314970
1 238354
1 238278
1 184890
1 183541
1 163630
1 163593
1 130073
1 129516
1 106772
1 75190
1 75167
1 21171
wc -l
refresh
cd ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish
find /oak/stanford/groups/mrivas/ukbb24983/array-combined/gwas/current -name "*.gz" -type l | sort > /oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish/gwas-current-gz-list.$(date +%Y%m%d-%H%M%S).txt
# gwas-current-gz-list.20200717-000322.txt
# 22172 lines
ml load resbatch
ml load R/3.6 gcc
sbatch -p mrivas --qos=high_p --time=1:0:00 --mem=6000 --nodes=1 --cores=1 --job-name=wc --output=logs/wc.%A_%a.out --error=logs/wc.%A_%a.err --array=1-964 $parallel_sbatch_sh gwas-current-gz-wc.sh gwas-current-gz-list.20200717-000322.txt 23
# Submitted batch job 4220848
bash check.missing_pop_GBE.sh
# missing_pop_GBE.minN100.20200717-001023.tsv
The line count looks good.
Based on these results, we started the following computation:
We can now jump on QC: #32
As a QC of the GWAS sum stats freeze, we perform line counts.
We identify the list of (pop, GBE_ID) pairs that satisfy the minimum N >= 100 criteria. We then ask whether we have the results in the
array-combined/gwas/current
directory.For the files linked from
array-combined/gwas/current
directory, we applywc -l
to see if the sum stats are complete.Summary
missing sum stats
As of 2020/6/27, we have the following number of traits missing in the
gwas/current
dirThe corresponding analysis notebook.
For others and related, the jobs were submitted.
incomplete sum stats
As of 2020/6/29, here is the summary of
wc -l
across populations.The corresponding analysis notebook.