jaamarks / jaamarks_notebooks

Collection of various projects and procedures documented with Jupyter notebooks.
0 stars 0 forks source link

Kreek EA—revisit #4

Closed jaamarks closed 5 years ago

jaamarks commented 5 years ago

The original GWAS was done in July of 2018 by me (Jesse Marks). The AA lambda value was good (1.011), but the EAs had a significant amount of inflation (1.223). For the EAs, two PCs were included in the model (1 & 6) and they explained ~78% of the variance. In an attempt to reduce the inflation we added the top 10% PCs to the model. This did reduce the inflation some, however the lambda value is still inflated as you can see below. Refer to the [GitHub issue #53] for more information.

EA

kreek ea 1000g oaall maf_gt_0 01 rsq_gt_0 3 assoc plot all_chr snps indels manhattan kreek ea 1000g oaall maf_gt_0 01 rsq_gt_0 3 assoc plot all_chr snps indels qq

jaamarks commented 5 years ago

In the overarching Bioinformatics Github Issue #53, on October 3, 2017 Eric Johnson describes that wants to combine the phenotype variables opioid_dep and opioid_abuse into one variable named opioid_abdp. I need to follow this description for the next GWAS I run. Before, I was only using the opioid_dep variable.

Recap of latest plan of action: Because the lambda value for the Kreek_EA group was still inflated after adding all 10 PCs, we are going back in to the quality control procedures and being more strict with the Structure plot ancestral thresholds. Before, the thresholds we used were the traditional---

Action Description Thresholding Criteria
For AA retainment (AFR > 25%) ∧ (EAS < 25%)
For EA retainment (AFR < 25%) ∧ (EAS < 25%)

These were strictly the ancestral thresholds (ignoring the self-report). I can apply a more stringent cutoff for the EAs

Action Description Thresholding Criteria
For AA retainment (AFR > 25%) ∧ (EAS < 25%)
For EA retainment (AFR < 10%) ∧ (EAS < 10%)

and see if this cleans up the plots.

This is what the filtered triangle plot looked like before: image

Currently (20181116) I am working on redoing the QC for Kreek and cleaning up the EA ancestry group. I will then submit the data to the Michigan Imputation Server (MIS) for imputation. Once that is finished, I will redo the GWAS for both AA and EA and include as a covariate the newly defined variable opioid_abdp. From there, I can assess whether I need to add the covariates cocaine_abdp and alcohol_abdp.

jaamarks commented 5 years ago

Sample R code

phenoD <- "/shared/data/studies/heroin/kreek/phenotype/"
in_fname <- "unprocessed/phs001109.v1.pht005447.v1.p1.c1.Addictive_Diseases_Subject_Phenotypes.DS-ADX-IRB-NPU.txt"
out_fname <- "processing/kreek.phenotype_table"

phenoF <- read.table(paste(phenoD, in_fname, sep=""), sep='\t', header = T)
filtered_phen <- phenoF[c('SUBJECT_ID', 'age', 'sex', 'opioid_dep', 'family_race', 'opioid_abuse')]

nrow(filtered_phen[filtered_phen$family_race=='White' & filtered_phen$opioid_dep==1 & filtered_phen$opioid_abuse==1,])
variable EA_cases EA_controls AA_cases AA_controls
opioid_dep 429 224 215 384
opioid_abuse 298 355 198 401
opioid_abdp 430 221 231 367
jaamarks commented 5 years ago

Thresholds used to generate the triangle plots for Kreek_EA and Kreek_AA.

Action Description Thresholding Criteria
For AA retainment (AFR > 25%) ∧ (EAS < 25%)
For EA retainment (AFR < 10%) ∧ (EAS < 10%)

Pre-filtered Post-filtered
afr_eas_eur_kreek_aa afr_eas_eur_filtered2_kreek_aa
afr_eas_eur_kreek_ea afr_eas_eur_filtered2_kreek_ea





A breakdown of the filtering results is in the table below. Note, these counts are different from before. In the previous QC processing I had included subjects with self-reported ancestry of AA, EA, or HA. Some of the HA subjects were reassigned to the AA or EA group after the Structure analysis. Then we proceeded with only the AA and EA ancestry groups. During this round of processing, I began only with subjects having self-reported ancestry of AA or EA. Therefore, the final counts are fewer after ancestry reassignments this time because there were no self-reported HA subjects reassigned to AAs or EAs, and also because the EA ancestry threshold is more stringent this round.

Ancestry Subjects Pre-Filtering Subjects Removed Subjects Added Subjects Post-Filtering
EA 651 36 4 619
AA 598 7 8 599

If this all sounds reasonable, I will continue with the QC processing of Kreek_AA and Kreek_EA.

jaamarks commented 5 years ago

Here are the triangle plots for the AAs and EAs. Below is also the pre-filtered self-reported HA triangle plot. I processed the self reported AAs, EAs, and HAs. Self-reported ancestry was then disregarded during the filtering so that any HA subject that was clearly AA or EA was reassigned. The filtering threshold used were:

Action Description Thresholding Criteria
For AA retainment (AFR > 25%) ∧ (EAS < 25%)
For EA retainment (AFR < 10%) ∧ (EAS < 10%)


AA

Pre-filtered Post-filtered
image image




EA

Pre-filtered Post-filtered
image image




HA (prefilter)

image




Here is a breakdown of the filtering results.

Ancestry Subjects Pre-Filtering Subjects Removed Subjects Added Subjects Post-Filtering
EA 651 35 18 634
AA 598 7 125 716
jaamarks commented 5 years ago

Genotype QC for the Kreek data set is finished. The quality control statistics are below. We will now proceed with the oaall GWAS of Kreek_EA in an attempt to lower the inflated lambda value seen earlier.

European Ancestry

The table below provides statistics on variants and subjects filtered during each step of the QC process.

Note: Initial subject counts are determined by considering only subjects with both genotype and phenotype data available. Consequently, initial numbers may not match between the .fam files and the phenotype files

Autosome statistics

This table includes filtering statistics prior to merging with chrX.

QC procedure Variants removed Variants retained Subjects removed Subjects added Subjects retained
* EA subject filter w/initial procedures (all chr) 0 565,791 1,004 0 651
* Partitioning to only autosomes 13,926 551,865 0 0 651
* Remove subjects missing whole autosome data 0 551,865 0 0 651
* Duplicate rsID filtering 0 551,865 0 0 651
* Strand flipping and polymorphic SNP filter 7,173 544,692 0 0 651
* Remove/Reassign ancestral outliers 0 544,692 35 18 634
* Remove variants with missing call rate > 3% 10,907 533,785 0 0 634
* Remove variants with HWE p < 0.0001 914 532,871 0 0 634

ChrX statistics

This table includes filtering statistics prior to merging with autosomes.

QC procedure Variants removed Variants retained Subjects removed Subjects added Subjects retained
* EA subject filter w/initial procedures (all chr) 0 565,791 1,004 0 651
* Partitioning to only chrX 551,865 13,926 0 0 651
* Remove subjects missing whole chrX data 0 13,926 0 0 651
* Duplicate rsID filtering 0 13,926 0 0 651
* Remove/Reassign ancestral outliers 0 13,926 35 18 634
* Remove variants with missing call rate > 3% 194 13,732 0 0 634
* Remove variants with HWE p < 0.0001 1 13,731 0 0 634

Merged autosome and chrX statistics

QC procedure Variants removed Variants retained Subjects removed Subjects added Subjects retained
* Merge autosomes and chrX 0 546,602 0 0 634
* Remove subjects with IBD > 0.4 or IBS > 0.9 0 546,602 17 0 617
* Remove subjects with missing call rate > 3% 0 546,602 3 0 614
* Sex discordance filter 0 546,602 0 0 614
* Excessive homozygosity filter 0 546,602 0 0 614
* Duplicate variant ID filter after 1000G renaming 1 546,601 0 0 614
* Remove ambiguous SNPs 36,549 510,052 0 0 614




African Ancestry

The table below provides statistics on variants and subjects filtered during each step of the QC process.

Note: Initial subject counts are determined by considering only subjects with both genotype and phenotype data available. Consequently, initial numbers may not match between the .fam files and the phenotype files

Autosome statistics

This table includes filtering statistics prior to merging with chrX.

QC procedure Variants removed Variants retained Subjects removed Subjects added Subjects retained
* AA subject filter w/initial procedures (all chr) 0 565,791 1,057 0 598
* Partitioning to only autosomes 13,926 551,865 0 0 598
* Remove subjects missing whole autosome data 0 551,865 0 0 598
* Duplicate rsID filtering 0 551,865 0 0 598
* Strand flipping and polymorphic SNP filter 7,173 544,692 0 0 598
* Reassign ancestral outliers 0 544,692 7 125 716
* Remove variants with missing call rate > 3% 17,129 527,563 0 0 716
* Remove variants with HWE p < 0.0001 2,453 525,110 0 0 716

ChrX statistics

This table includes filtering statistics prior to merging with autosomes.

QC procedure Variants removed Variants retained Subjects removed Subjects added Subjects retained
* AA subject filter w/initial procedures (all chr) 0 565,791 1,057 0 598
* Partitioning to only chrX 551,865 13,926 0 0 598
* Remove subjects missing whole chrX data 0 13,926 0 0 598
* Duplicate rsID filtering 0 13,926 0 0 598
* Remove/reassign ancestral outliers 0 13,926 7 125 716
* Remove variants with missing call rate > 3% 350 13,576 0 0 716
* Remove variants with HWE p < 0.0001 23 13,553 0 0 716

Merged autosome and chrX statistics

QC procedure Variants removed Variants retained Subjects removed Subjects added Subjects retained
* Merge autosomes and chrX 0 538,663 0 0 716
* Remove subjects with IBD > 0.4 or IBS > 0.9 0 538,663 23 0 693
* Remove subjects with missing call rate > 3% 0 538,663 3 0 690
* Sex discordance filter 0 538,663 0 0 690
* Excessive homozygosity filter 0 538,663 0 0 690
* Duplicate variant ID filter after 1000G renaming 1 538,662 0 0 690
* Remove ambiguous SNPs 35,542 503,120 0 0 690
jaamarks commented 5 years ago

Kreek QC, imputation, & baseline OAall GWAS now complete—results/details below. EAs still exhibit inflation (lambda=1.176). AAs are copacetic.

QC Updates: This round had updates for both EA and AA ancestry groups. In the first round of QC (finished June '18) self-reported EA, AA, and HA were to be processed. It was then decided that we should disregard self-reported ancestry and reassign subjects to the ancestry their genotypes most closely align with. Though this was the intention, I don't believe the self-reported HA subjects were actually reassigned—I think they were excluded altogether. This is evident upon examining the QC summary statistics from the first round. It is most apparent when comparing the final subject counts for the AA subjects. So for this latest round of QC, I have:


GWAS phenotype updates: In the previous Kreek OAall GWAS I used the variable opioid_dep as the phenotype of interest in the model. After investigating some of the earlier posts about the Kreek data, I learned that I should define a new variable opioid_abdp which combines the opioid_dep (dependence) and opioid_ab (abuse) variables. So,

First round of QC (June '18)

Ancestry Variant Count Subject Count
AA 501,495 576
EA 509,619 624

Latest round of QC

Ancestry Variant Count Subject Count
AA 503,120 690
EA 510,052 614



AA PCA Plots

I included PC1 & PC10 in the model. These PCs explain ~75% of the variance. image


EA PCA Plots

I included PC1, PC3, and PC5 in the model. These PCs explain ~77% of the variance. image




AA GWAS Plots

oaall ~ SNP + sex + age + PC1 + PC10

Manhattan QQ
image image



EA GWAS Plots

oaall ~ SNP + sex + age + PC1 + PC3 + PC5

Manhattan QQ
image image





I also ran a GWAS for Kreek_EA with all of the top 10 PCs. This did not reduce inflation, in fact, it actually increased the lambda value slightly to 1.183 from 1.176. Here are those plots

EA GWAS Plots (all PCs included)

oaall ~ SNP + sex + age + PC1 + PC2 + ... + PC9 + PC10

Manhattan QQ
image image
jaamarks commented 5 years ago

Since there is still inflation exhibited by the EAs, another possible strategy is to apply an even more stringent ancestral threshold---in particular, apply the following thresholds and push through QC and GWAS. Note, that I do not have to impute those subjects again. I simply need the IDs of the subjects to keep. Attached are those subjects IDs of the subjects to keep in the triangle plot.

Action Description Thresholding Criteria
For EA retainment (AFR < 5%) ∧ (EAS < 5%)

EA Triangle Plot

image

Ancestry Subjects Pre-Filtering Subjects Removed Subjects Added Subjects Post-Filtering
EA 651 85 9 575
# number of self-report EAs retained
awk '$3=="Kreek_EA"' ea_filtered3 | wc -l
566

# number of self-report HAs reassigned to EA 
awk '$3=="Kreek_HA"' ea_filtered3 | wc -l
5

# number of self-report HAs reassigned to EA 
awk '$3=="Kreek_AA"' ea_filtered3 | wc -l
4

ea_subject_ids_5_percent.keep.txt

jaamarks commented 5 years ago

After some experimenting, I was able to reduce the inflation somewhat, image, in the EAs by applying an even more stringent ancestral threshold during the Structure analysis:

Action Description Thresholding Criteria
For EA retainment (AFR < 5%) ∧ (EAS < 5%)
EA Triangle Plot
image




Structure Analysis Stats

Ancestry Subjects Pre-Filtering Subjects Removed Subjects Added Subjects Post-Filtering
EA 651 85 9 575




Manhattan QQ
image image

Note this is with all PCs included in the model as covariates. oaall ~ SNP + sex + age + PC1 + PC2 + ... + PC9 + PC10




We do include 57 fewer EA subjects in the GWAS though—556 compared to 614. So I don't know if we want to go this route.

jaamarks commented 5 years ago

Eric Johnson has indicated that we will go with these more stringent thresholds, and use these GWAS results in the NGC OAall meta-analysis. See GitHub issue #53.

jaamarks commented 5 years ago

The results+data are now on s3.

imputed data


GWAS results