Enumerate cohort-specific QC

It has become clear from working through the GWAS tutorial, the UKBB paper, and discussions with @marivascruz that QC can be separated into standard tasks that can be applied to most cohorts and cohort-specific tasks. Importantly, some cohort-specific QC must be discovered through exploratory analysis of each cohort.

It seems that Plink is quite good for the standard QC steps, and a tool like Hail is most useful for the discovery and implementation of cohort-specific QC.

To better motivate the work we're doing in this repo, it would be useful to enumerate each category of QC task. It would be particularly helpful to have concrete examples of cohort-specific QC discovery, such as the ScatterShot-derived metric @marivascruz discussed on our call today.

In particular, @cseed has pointed us to the gnomAD team's QC efforts, and @marivascruz mentioned the BBJ did some cohort-specific QC.

For the BBJ, I've found The Biobank Japan project genotype data, and the methods section for Genome-wide association study identifies 112 new loci for body mass index in the Japanese population (2017). I couldn't find much useful information in these documents; here's their description of their GWAS QC:

For the quality control of GWAS, we excluded samples with a call rate ≤0.98. Closely related samples, which were estimated using identity by state (IBS), were excluded by visual inspection. We performed principal component analysis (PCA) for genotype using an in-house program based on the algorithm implemented by smartpca49, and we excluded outliers from the East Asian cluster. Finally, we calculated the Z-score for height by linear regression using age, sex, status of 47 diseases, and the top 10 principal components (PCs) and excluded individuals outside of ±4 s.d. for the purpose of quality control of the phenotype data.

related-sciences / gwas-analysis

Enumerate cohort-specific QC #12