yk-tanigawa commented 4 years ago

As a QC of the GWAS sum stats freeze, we perform line counts.

We identify the list of (pop, GBE_ID) pairs that satisfy the minimum N >= 100 criteria. We then ask whether we have the results in the array-combined/gwas/current directory.

For the files linked from array-combined/gwas/current directory, we apply wc -l to see if the sum stats are complete.

Summary

missing sum stats

As of 2020/6/27, we have the following number of traits missing in the gwas/current dir

Screenshot 2020-07-04 13 22 55

The corresponding analysis notebook.

For others and related, the jobs were submitted.

incomplete sum stats

As of 2020/6/29, here is the summary of wc -l across populations.

The corresponding analysis notebook.

yk-tanigawa commented 4 years ago

Update on counts

1. Finalized summary statistics files

wc_l	population	n
1080969	african	2696
1080969	e_asian	1917
1080969	non_british_white	3226
1080969	others	3294
1080969	related	3357
1080969	s_asian	2863
1080969	white_british	3587
1080600	others	1
1080600	related	3
1080278	others	144
1080278	related	83

We investigated the following files in a separate ticket (#18)
- The 4 files with 1,080,600 lines
- The 227 files with 1,080,278 lines

2. Files that will be fixed with the on-going computation

wc_l	population	n
1059397	african	216
1059397	e_asian	252
1059397	non_british_white	148
1059397	s_asian	132
1059397	white_british	132

We are computing the sum stats for those in #19

3. File(s) that need attention

wc_l	population	n
1080567	white_british	1

I thought #17 fixed this file, but it was not the case.

Also, #20 have some fix.

4. Other incomplete or missing files

There are 302 files that need to be generated and/or refreshed.

For more information, please check here

$ cat gwas-current-gz-wc.20200704-155715.combined.tsv | awk '$5 < 1059397' | wc
    302    2114   43103

$ cat gwas-current-gz-wc.20200704-155715.combined.tsv | awk '$5 < 1059397' | cut -f2 | sort | uniq -c
     20 african
     96 e_asian
     41 non_british_white
     12 others
     11 related
     14 s_asian
    108 white_british

yk-tanigawa commented 4 years ago

20 is now finished.

yk-tanigawa commented 4 years ago

So, in terms of the remaining jobs, we have

20 binary traits in #19
27 QTs in #19
65 traits in #24

yk-tanigawa commented 4 years ago

The patch (#19) generated 824 files with 1080278 lines (-691) because the chrY variants were skipped and one file with 1080969 lines.

Skipping chrY in --glm regression on phenotype 'PHENO1'

yk-tanigawa commented 4 years ago

Re-computing wc -l

cd ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish

find /oak/stanford/groups/mrivas/ukbb24983/array-combined/gwas/current -name "*.gz" -type l | sort > /oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish/gwas-current-gz-list.$(date +%Y%m%d-%H%M%S).txt

# gwas-current-gz-list.20200711-232847.txt
# 22040 lines

ml load resbatch
ml load R/3.6 gcc

sbatch -p mrivas --qos=high_p --time=1:0:00 --mem=6000 --nodes=1 --cores=1 --job-name=wc --output=logs/wc.%A_%a.out --error=logs/wc.%A_%a.err --array=1-959 $parallel_sbatch_sh gwas-current-gz-wc.sh gwas-current-gz-list.20200711-232847.txt 23

# Submitted batch job 3923943

bash check.missing_pop_GBE.sh

yk-tanigawa commented 4 years ago

missing_pop_GBE.minN100.20200711-233434.tsv

ToDo --> aggregate the wc -l following the instruction here

yk-tanigawa commented 4 years ago

Aggregate the wc -l results

find logs/ -name "wc.392*err" | parallel 'tail {}' | grep array-end | wc -l
959

bash gwas-current-gz-wc-cat.sh
gwas-current-gz-wc.20200712-071851.tsv

rm gwas-current-gz-list.20200711-232847.txt

yk-tanigawa commented 4 years ago

Update on counts

$ cat gwas-current-gz-wc.20200712-071851.tsv | awk '(NR>1){print $3}' | sort -nr | uniq -c
  20952 1080969
      4 1080600
      1 1080567
   1051 1080278
      1 1024521
      1 1001756
      1 891019
      1 890538
      1 837659
      1 836299
      1 779198
      1 727614
      1 726430
      1 725231
      1 672894
      1 671362
      1 454904
      1 412155
      1 402771
      1 400782
      1 399908
      1 347741
      1 345881
      1 314970
      1 238354
      1 238278
      1 184890
      1 183541
      1 163630
      1 163593
      1 130073
      1 129516
      1 106772
      1 75190
      1 75167
      1 21171

We've already investigated the followings

The 4 files with 1,080,600 lines (#18)
The 227 + 824 files with 1,080,278 lines (#18, #19)

Unknown error (?)

The 1 file with 1,080,567 lines (see #17)

on-going effort

24

yk-tanigawa commented 4 years ago

wc -l refresh

cd ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish

find /oak/stanford/groups/mrivas/ukbb24983/array-combined/gwas/current -name "*.gz" -type l | sort > /oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish/gwas-current-gz-list.$(date +%Y%m%d-%H%M%S).txt

# gwas-current-gz-list.20200717-000322.txt
# 22172 lines

ml load resbatch
ml load R/3.6 gcc

sbatch -p mrivas --qos=high_p --time=1:0:00 --mem=6000 --nodes=1 --cores=1 --job-name=wc --output=logs/wc.%A_%a.out --error=logs/wc.%A_%a.err --array=1-964 $parallel_sbatch_sh gwas-current-gz-wc.sh gwas-current-gz-list.20200717-000322.txt 23

# Submitted batch job 4220848

bash check.missing_pop_GBE.sh
# missing_pop_GBE.minN100.20200717-001023.tsv

yk-tanigawa commented 4 years ago

Screenshot 2020-07-17 10 02 13

yk-tanigawa commented 4 years ago

yk-tanigawa commented 4 years ago

The line count looks good.

Based on these results, we started the following computation:

LDSC munge: #26
UKB Metal: #22

We can now jump on QC: #32

rivas-lab / ukbb-tools

GWAS finishing effort - Simple line counts check #21

Summary

missing sum stats

incomplete sum stats

Update on counts

1. Finalized summary statistics files

2. Files that will be fixed with the on-going computation

3. File(s) that need attention

4. Other incomplete or missing files

20 is now finished.

Update on counts

We've already investigated the followings

Unknown error (?)

on-going effort

24