rivas-lab / ukbb-tools

Tools for preprocessing, QC, and preliminary analyses from raw UK BioBank data
8 stars 0 forks source link

GWAS finishing effort - Simple line counts check #21

Closed yk-tanigawa closed 4 years ago

yk-tanigawa commented 4 years ago

As a QC of the GWAS sum stats freeze, we perform line counts.

We identify the list of (pop, GBE_ID) pairs that satisfy the minimum N >= 100 criteria. We then ask whether we have the results in the array-combined/gwas/current directory.

For the files linked from array-combined/gwas/current directory, we apply wc -l to see if the sum stats are complete.

Summary

missing sum stats

As of 2020/6/27, we have the following number of traits missing in the gwas/current dir

Screenshot 2020-07-04 13 22 55

The corresponding analysis notebook.

For others and related, the jobs were submitted.

incomplete sum stats

As of 2020/6/29, here is the summary of wc -l across populations.

Screenshot 2020-07-04 13 15 23

The corresponding analysis notebook.

yk-tanigawa commented 4 years ago

Update on counts

1. Finalized summary statistics files

wc_l population n
1080969 african 2696
1080969 e_asian 1917
1080969 non_british_white 3226
1080969 others 3294
1080969 related 3357
1080969 s_asian 2863
1080969 white_british 3587
1080600 others 1
1080600 related 3
1080278 others 144
1080278 related 83

2. Files that will be fixed with the on-going computation

wc_l population n
1059397 african 216
1059397 e_asian 252
1059397 non_british_white 148
1059397 s_asian 132
1059397 white_british 132

We are computing the sum stats for those in #19

3. File(s) that need attention

wc_l population n
1080567 white_british 1

I thought #17 fixed this file, but it was not the case.

Also, #20 have some fix.

4. Other incomplete or missing files

There are 302 files that need to be generated and/or refreshed.

For more information, please check here

$ cat gwas-current-gz-wc.20200704-155715.combined.tsv | awk '$5 < 1059397' | wc
    302    2114   43103

$ cat gwas-current-gz-wc.20200704-155715.combined.tsv | awk '$5 < 1059397' | cut -f2 | sort | uniq -c
     20 african
     96 e_asian
     41 non_british_white
     12 others
     11 related
     14 s_asian
    108 white_british
yk-tanigawa commented 4 years ago

20 is now finished.

yk-tanigawa commented 4 years ago

So, in terms of the remaining jobs, we have

yk-tanigawa commented 4 years ago

The patch (#19) generated 824 files with 1080278 lines (-691) because the chrY variants were skipped and one file with 1080969 lines.

Skipping chrY in --glm regression on phenotype 'PHENO1'
yk-tanigawa commented 4 years ago

Re-computing wc -l

cd ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish

find /oak/stanford/groups/mrivas/ukbb24983/array-combined/gwas/current -name "*.gz" -type l | sort > /oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish/gwas-current-gz-list.$(date +%Y%m%d-%H%M%S).txt

# gwas-current-gz-list.20200711-232847.txt
# 22040 lines

ml load resbatch
ml load R/3.6 gcc

sbatch -p mrivas --qos=high_p --time=1:0:00 --mem=6000 --nodes=1 --cores=1 --job-name=wc --output=logs/wc.%A_%a.out --error=logs/wc.%A_%a.err --array=1-959 $parallel_sbatch_sh gwas-current-gz-wc.sh gwas-current-gz-list.20200711-232847.txt 23

# Submitted batch job 3923943

bash check.missing_pop_GBE.sh
yk-tanigawa commented 4 years ago

missing_pop_GBE.minN100.20200711-233434.tsv

ToDo --> aggregate the wc -l following the instruction here

yk-tanigawa commented 4 years ago

Aggregate the wc -l results

find logs/ -name "wc.392*err" | parallel 'tail {}' | grep array-end | wc -l
959

bash gwas-current-gz-wc-cat.sh
gwas-current-gz-wc.20200712-071851.tsv

rm gwas-current-gz-list.20200711-232847.txt
yk-tanigawa commented 4 years ago

Update on counts

$ cat gwas-current-gz-wc.20200712-071851.tsv | awk '(NR>1){print $3}' | sort -nr | uniq -c
  20952 1080969
      4 1080600
      1 1080567
   1051 1080278
      1 1024521
      1 1001756
      1 891019
      1 890538
      1 837659
      1 836299
      1 779198
      1 727614
      1 726430
      1 725231
      1 672894
      1 671362
      1 454904
      1 412155
      1 402771
      1 400782
      1 399908
      1 347741
      1 345881
      1 314970
      1 238354
      1 238278
      1 184890
      1 183541
      1 163630
      1 163593
      1 130073
      1 129516
      1 106772
      1 75190
      1 75167
      1 21171

We've already investigated the followings

Unknown error (?)

on-going effort

yk-tanigawa commented 4 years ago

wc -l refresh

cd ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish

find /oak/stanford/groups/mrivas/ukbb24983/array-combined/gwas/current -name "*.gz" -type l | sort > /oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish/gwas-current-gz-list.$(date +%Y%m%d-%H%M%S).txt

# gwas-current-gz-list.20200717-000322.txt
# 22172 lines

ml load resbatch
ml load R/3.6 gcc

sbatch -p mrivas --qos=high_p --time=1:0:00 --mem=6000 --nodes=1 --cores=1 --job-name=wc --output=logs/wc.%A_%a.out --error=logs/wc.%A_%a.err --array=1-964 $parallel_sbatch_sh gwas-current-gz-wc.sh gwas-current-gz-list.20200717-000322.txt 23

# Submitted batch job 4220848

bash check.missing_pop_GBE.sh
# missing_pop_GBE.minN100.20200717-001023.tsv
yk-tanigawa commented 4 years ago

Screenshot 2020-07-17 10 02 13

yk-tanigawa commented 4 years ago
Screenshot 2020-07-17 17 40 23
yk-tanigawa commented 4 years ago

The line count looks good.

Based on these results, we started the following computation:

We can now jump on QC: #32