Closed jaamarks closed 5 years ago
In the overarching Bioinformatics Github Issue #53, on October 3, 2017 Eric Johnson describes that wants to combine the phenotype variables opioid_dep
and opioid_abuse
into one variable named opioid_abdp
. I need to follow this description for the next GWAS I run. Before, I was only using the opioid_dep
variable.
Recap of latest plan of action: Because the lambda value for the Kreek_EA group was still inflated after adding all 10 PCs, we are going back in to the quality control procedures and being more strict with the Structure plot ancestral thresholds. Before, the thresholds we used were the traditional---
Action Description | Thresholding Criteria |
---|---|
For AA retainment | (AFR > 25%) ∧ (EAS < 25%) |
For EA retainment | (AFR < 25%) ∧ (EAS < 25%) |
These were strictly the ancestral thresholds (ignoring the self-report). I can apply a more stringent cutoff for the EAs
Action Description | Thresholding Criteria |
---|---|
For AA retainment | (AFR > 25%) ∧ (EAS < 25%) |
For EA retainment | (AFR < 10%) ∧ (EAS < 10%) |
and see if this cleans up the plots.
This is what the filtered triangle plot looked like before:
Currently (20181116) I am working on redoing the QC for Kreek and cleaning up the EA ancestry group. I will then submit the data to the Michigan Imputation Server (MIS) for imputation. Once that is finished, I will redo the GWAS for both AA and EA and include as a covariate the newly defined variable opioid_abdp
. From there, I can assess whether I need to add the covariates cocaine_abdp
and alcohol_abdp
.
phenoD <- "/shared/data/studies/heroin/kreek/phenotype/"
in_fname <- "unprocessed/phs001109.v1.pht005447.v1.p1.c1.Addictive_Diseases_Subject_Phenotypes.DS-ADX-IRB-NPU.txt"
out_fname <- "processing/kreek.phenotype_table"
phenoF <- read.table(paste(phenoD, in_fname, sep=""), sep='\t', header = T)
filtered_phen <- phenoF[c('SUBJECT_ID', 'age', 'sex', 'opioid_dep', 'family_race', 'opioid_abuse')]
nrow(filtered_phen[filtered_phen$family_race=='White' & filtered_phen$opioid_dep==1 & filtered_phen$opioid_abuse==1,])
variable | EA_cases | EA_controls | AA_cases | AA_controls |
---|---|---|---|---|
opioid_dep | 429 | 224 | 215 | 384 |
opioid_abuse | 298 | 355 | 198 | 401 |
opioid_abdp | 430 | 221 | 231 | 367 |
Thresholds used to generate the triangle plots for Kreek_EA and Kreek_AA.
Action Description | Thresholding Criteria |
---|---|
For AA retainment | (AFR > 25%) ∧ (EAS < 25%) |
For EA retainment | (AFR < 10%) ∧ (EAS < 10%) |
Pre-filtered | Post-filtered |
---|---|
A breakdown of the filtering results is in the table below. Note, these counts are different from before. In the previous QC processing I had included subjects with self-reported ancestry of AA, EA, or HA. Some of the HA subjects were reassigned to the AA or EA group after the Structure analysis. Then we proceeded with only the AA and EA ancestry groups. During this round of processing, I began only with subjects having self-reported ancestry of AA or EA. Therefore, the final counts are fewer after ancestry reassignments this time because there were no self-reported HA subjects reassigned to AAs or EAs, and also because the EA ancestry threshold is more stringent this round.
Ancestry | Subjects Pre-Filtering | Subjects Removed | Subjects Added | Subjects Post-Filtering |
---|---|---|---|---|
EA | 651 | 36 | 4 | 619 |
AA | 598 | 7 | 8 | 599 |
If this all sounds reasonable, I will continue with the QC processing of Kreek_AA and Kreek_EA.
Here are the triangle plots for the AAs and EAs. Below is also the pre-filtered self-reported HA triangle plot. I processed the self reported AAs, EAs, and HAs. Self-reported ancestry was then disregarded during the filtering so that any HA subject that was clearly AA or EA was reassigned. The filtering threshold used were:
Action Description | Thresholding Criteria |
---|---|
For AA retainment | (AFR > 25%) ∧ (EAS < 25%) |
For EA retainment | (AFR < 10%) ∧ (EAS < 10%) |
Pre-filtered | Post-filtered |
---|---|
Pre-filtered | Post-filtered |
---|---|
Here is a breakdown of the filtering results.
Ancestry | Subjects Pre-Filtering | Subjects Removed | Subjects Added | Subjects Post-Filtering |
---|---|---|---|---|
EA | 651 | 35 | 18 | 634 |
AA | 598 | 7 | 125 | 716 |
Genotype QC for the Kreek data set is finished. The quality control statistics are below. We will now proceed with the oaall GWAS of Kreek_EA in an attempt to lower the inflated lambda value seen earlier.
The table below provides statistics on variants and subjects filtered during each step of the QC process.
Note: Initial subject counts are determined by considering only subjects with both genotype and phenotype data available. Consequently, initial numbers may not match between the .fam files and the phenotype files
This table includes filtering statistics prior to merging with chrX.
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects added | Subjects retained |
---|---|---|---|---|---|
* EA subject filter w/initial procedures (all chr) | 0 | 565,791 | 1,004 | 0 | 651 |
* Partitioning to only autosomes | 13,926 | 551,865 | 0 | 0 | 651 |
* Remove subjects missing whole autosome data | 0 | 551,865 | 0 | 0 | 651 |
* Duplicate rsID filtering | 0 | 551,865 | 0 | 0 | 651 |
* Strand flipping and polymorphic SNP filter | 7,173 | 544,692 | 0 | 0 | 651 |
* Remove/Reassign ancestral outliers | 0 | 544,692 | 35 | 18 | 634 |
* Remove variants with missing call rate > 3% | 10,907 | 533,785 | 0 | 0 | 634 |
* Remove variants with HWE p < 0.0001 | 914 | 532,871 | 0 | 0 | 634 |
This table includes filtering statistics prior to merging with autosomes.
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects added | Subjects retained |
---|---|---|---|---|---|
* EA subject filter w/initial procedures (all chr) | 0 | 565,791 | 1,004 | 0 | 651 |
* Partitioning to only chrX | 551,865 | 13,926 | 0 | 0 | 651 |
* Remove subjects missing whole chrX data | 0 | 13,926 | 0 | 0 | 651 |
* Duplicate rsID filtering | 0 | 13,926 | 0 | 0 | 651 |
* Remove/Reassign ancestral outliers | 0 | 13,926 | 35 | 18 | 634 |
* Remove variants with missing call rate > 3% | 194 | 13,732 | 0 | 0 | 634 |
* Remove variants with HWE p < 0.0001 | 1 | 13,731 | 0 | 0 | 634 |
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects added | Subjects retained |
---|---|---|---|---|---|
* Merge autosomes and chrX | 0 | 546,602 | 0 | 0 | 634 |
* Remove subjects with IBD > 0.4 or IBS > 0.9 | 0 | 546,602 | 17 | 0 | 617 |
* Remove subjects with missing call rate > 3% | 0 | 546,602 | 3 | 0 | 614 |
* Sex discordance filter | 0 | 546,602 | 0 | 0 | 614 |
* Excessive homozygosity filter | 0 | 546,602 | 0 | 0 | 614 |
* Duplicate variant ID filter after 1000G renaming | 1 | 546,601 | 0 | 0 | 614 |
* Remove ambiguous SNPs | 36,549 | 510,052 | 0 | 0 | 614 |
The table below provides statistics on variants and subjects filtered during each step of the QC process.
Note: Initial subject counts are determined by considering only subjects with both genotype and phenotype data available. Consequently, initial numbers may not match between the .fam files and the phenotype files
This table includes filtering statistics prior to merging with chrX.
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects added | Subjects retained |
---|---|---|---|---|---|
* AA subject filter w/initial procedures (all chr) | 0 | 565,791 | 1,057 | 0 | 598 |
* Partitioning to only autosomes | 13,926 | 551,865 | 0 | 0 | 598 |
* Remove subjects missing whole autosome data | 0 | 551,865 | 0 | 0 | 598 |
* Duplicate rsID filtering | 0 | 551,865 | 0 | 0 | 598 |
* Strand flipping and polymorphic SNP filter | 7,173 | 544,692 | 0 | 0 | 598 |
* Reassign ancestral outliers | 0 | 544,692 | 7 | 125 | 716 |
* Remove variants with missing call rate > 3% | 17,129 | 527,563 | 0 | 0 | 716 |
* Remove variants with HWE p < 0.0001 | 2,453 | 525,110 | 0 | 0 | 716 |
This table includes filtering statistics prior to merging with autosomes.
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects added | Subjects retained |
---|---|---|---|---|---|
* AA subject filter w/initial procedures (all chr) | 0 | 565,791 | 1,057 | 0 | 598 |
* Partitioning to only chrX | 551,865 | 13,926 | 0 | 0 | 598 |
* Remove subjects missing whole chrX data | 0 | 13,926 | 0 | 0 | 598 |
* Duplicate rsID filtering | 0 | 13,926 | 0 | 0 | 598 |
* Remove/reassign ancestral outliers | 0 | 13,926 | 7 | 125 | 716 |
* Remove variants with missing call rate > 3% | 350 | 13,576 | 0 | 0 | 716 |
* Remove variants with HWE p < 0.0001 | 23 | 13,553 | 0 | 0 | 716 |
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects added | Subjects retained |
---|---|---|---|---|---|
* Merge autosomes and chrX | 0 | 538,663 | 0 | 0 | 716 |
* Remove subjects with IBD > 0.4 or IBS > 0.9 | 0 | 538,663 | 23 | 0 | 693 |
* Remove subjects with missing call rate > 3% | 0 | 538,663 | 3 | 0 | 690 |
* Sex discordance filter | 0 | 538,663 | 0 | 0 | 690 |
* Excessive homozygosity filter | 0 | 538,663 | 0 | 0 | 690 |
* Duplicate variant ID filter after 1000G renaming | 1 | 538,662 | 0 | 0 | 690 |
* Remove ambiguous SNPs | 35,542 | 503,120 | 0 | 0 | 690 |
Kreek QC, imputation, & baseline OAall GWAS now complete—results/details below. EAs still exhibit inflation (lambda=1.176). AAs are copacetic.
QC Updates: This round had updates for both EA and AA ancestry groups. In the first round of QC (finished June '18) self-reported EA, AA, and HA were to be processed. It was then decided that we should disregard self-reported ancestry and reassign subjects to the ancestry their genotypes most closely align with. Though this was the intention, I don't believe the self-reported HA subjects were actually reassigned—I think they were excluded altogether. This is evident upon examining the QC summary statistics from the first round. It is most apparent when comparing the final subject counts for the AA subjects. So for this latest round of QC, I have:
GWAS phenotype updates: In the previous Kreek OAall GWAS I used the variable opioid_dep
as the phenotype of interest in the model. After investigating some of the earlier posts about the Kreek data, I learned that I should define a new variable opioid_abdp
which combines the opioid_dep
(dependence) and opioid_ab
(abuse) variables. So,
opioid_abdp
variable is the phenotype of interest used in this latest GWAS roundAncestry | Variant Count | Subject Count |
---|---|---|
AA | 501,495 | 576 |
EA | 509,619 | 624 |
Ancestry | Variant Count | Subject Count |
---|---|---|
AA | 503,120 | 690 |
EA | 510,052 | 614 |
I included PC1 & PC10 in the model. These PCs explain ~75% of the variance.
I included PC1, PC3, and PC5 in the model. These PCs explain ~77% of the variance.
oaall ~ SNP + sex + age + PC1 + PC10
Manhattan | |
---|---|
oaall ~ SNP + sex + age + PC1 + PC3 + PC5
Manhattan | |
---|---|
I also ran a GWAS for Kreek_EA with all of the top 10 PCs. This did not reduce inflation, in fact, it actually increased the lambda value slightly to 1.183 from 1.176. Here are those plots
oaall ~ SNP + sex + age + PC1 + PC2 + ... + PC9 + PC10
Manhattan | |
---|---|
Since there is still inflation exhibited by the EAs, another possible strategy is to apply an even more stringent ancestral threshold---in particular, apply the following thresholds and push through QC and GWAS. Note, that I do not have to impute those subjects again. I simply need the IDs of the subjects to keep. Attached are those subjects IDs of the subjects to keep in the triangle plot.
Action Description | Thresholding Criteria |
---|---|
For EA retainment | (AFR < 5%) ∧ (EAS < 5%) |
Ancestry | Subjects Pre-Filtering | Subjects Removed | Subjects Added | Subjects Post-Filtering |
---|---|---|---|---|
EA | 651 | 85 | 9 | 575 |
# number of self-report EAs retained
awk '$3=="Kreek_EA"' ea_filtered3 | wc -l
566
# number of self-report HAs reassigned to EA
awk '$3=="Kreek_HA"' ea_filtered3 | wc -l
5
# number of self-report HAs reassigned to EA
awk '$3=="Kreek_AA"' ea_filtered3 | wc -l
4
After some experimenting, I was able to reduce the inflation somewhat, , in the EAs by applying an even more stringent ancestral threshold during the Structure analysis:
Action Description | Thresholding Criteria |
---|---|
For EA retainment | (AFR < 5%) ∧ (EAS < 5%) |
EA Triangle Plot |
---|
Ancestry | Subjects Pre-Filtering | Subjects Removed | Subjects Added | Subjects Post-Filtering |
---|---|---|---|---|
EA | 651 | 85 | 9 | 575 |
Manhattan | |
---|---|
Note this is with all PCs included in the model as covariates.
oaall ~ SNP + sex + age + PC1 + PC2 + ... + PC9 + PC10
We do include 57 fewer EA subjects in the GWAS though—556 compared to 614. So I don't know if we want to go this route.
Eric Johnson has indicated that we will go with these more stringent thresholds, and use these GWAS results in the NGC OAall meta-analysis. See GitHub issue #53.
The results+data are now on s3.
imputed data
s3://rti-heroin/kreek/data/genotype/imputed/20181128/{aa,ea}
GWAS results
s3://rti-heroin/kreek/results/ea/oaall/20181201
s3://rti-heroin/kreek/results/aa/oaall/20181128
The original GWAS was done in July of 2018 by me (Jesse Marks). The AA lambda value was good (1.011), but the EAs had a significant amount of inflation (1.223). For the EAs, two PCs were included in the model (1 & 6) and they explained ~78% of the variance. In an attempt to reduce the inflation we added the top 10% PCs to the model. This did reduce the inflation some, however the lambda value is still inflated as you can see below. Refer to the [GitHub issue #53] for more information.
EA