jaamarks commented 5 years ago

The original GWAS was done in July of 2018 by me (Jesse Marks). The AA lambda value was good (1.011), but the EAs had a significant amount of inflation (1.223). For the EAs, two PCs were included in the model (1 & 6) and they explained ~78% of the variance. In an attempt to reduce the inflation we added the top 10% PCs to the model. This did reduce the inflation some, however the lambda value is still inflated as you can see below. Refer to the [GitHub issue #53] for more information.

EA

kreek ea 1000g oaall maf_gt_0 01 rsq_gt_0 3 assoc plot all_chr snps indels manhattan

jaamarks commented 5 years ago

In the overarching Bioinformatics Github Issue #53, on October 3, 2017 Eric Johnson describes that wants to combine the phenotype variables opioid_dep and opioid_abuse into one variable named opioid_abdp. I need to follow this description for the next GWAS I run. Before, I was only using the opioid_dep variable.

Recap of latest plan of action: Because the lambda value for the Kreek_EA group was still inflated after adding all 10 PCs, we are going back in to the quality control procedures and being more strict with the Structure plot ancestral thresholds. Before, the thresholds we used were the traditional---

Action Description	Thresholding Criteria
For AA retainment	(AFR > 25%) ∧ (EAS < 25%)
For EA retainment	(AFR < 25%) ∧ (EAS < 25%)

These were strictly the ancestral thresholds (ignoring the self-report). I can apply a more stringent cutoff for the EAs

Action Description	Thresholding Criteria
For AA retainment	(AFR > 25%) ∧ (EAS < 25%)
For EA retainment	(AFR < 10%) ∧ (EAS < 10%)

and see if this cleans up the plots.

This is what the filtered triangle plot looked like before:

Currently (20181116) I am working on redoing the QC for Kreek and cleaning up the EA ancestry group. I will then submit the data to the Michigan Imputation Server (MIS) for imputation. Once that is finished, I will redo the GWAS for both AA and EA and include as a covariate the newly defined variable opioid_abdp. From there, I can assess whether I need to add the covariates cocaine_abdp and alcohol_abdp.

jaamarks commented 5 years ago

Sample R code

phenoD <- "/shared/data/studies/heroin/kreek/phenotype/"
in_fname <- "unprocessed/phs001109.v1.pht005447.v1.p1.c1.Addictive_Diseases_Subject_Phenotypes.DS-ADX-IRB-NPU.txt"
out_fname <- "processing/kreek.phenotype_table"

phenoF <- read.table(paste(phenoD, in_fname, sep=""), sep='\t', header = T)
filtered_phen <- phenoF[c('SUBJECT_ID', 'age', 'sex', 'opioid_dep', 'family_race', 'opioid_abuse')]

nrow(filtered_phen[filtered_phen$family_race=='White' & filtered_phen$opioid_dep==1 & filtered_phen$opioid_abuse==1,])

variable	EA_cases	EA_controls	AA_cases	AA_controls
opioid_dep	429	224	215	384
opioid_abuse	298	355	198	401
opioid_abdp	430	221	231	367

jaamarks commented 5 years ago

Thresholds used to generate the triangle plots for Kreek_EA and Kreek_AA.

Action Description	Thresholding Criteria
For AA retainment	(AFR > 25%) ∧ (EAS < 25%)
For EA retainment	(AFR < 10%) ∧ (EAS < 10%)

Pre-filtered	Post-filtered

A breakdown of the filtering results is in the table below. Note, these counts are different from before. In the previous QC processing I had included subjects with self-reported ancestry of AA, EA, or HA. Some of the HA subjects were reassigned to the AA or EA group after the Structure analysis. Then we proceeded with only the AA and EA ancestry groups. During this round of processing, I began only with subjects having self-reported ancestry of AA or EA. Therefore, the final counts are fewer after ancestry reassignments this time because there were no self-reported HA subjects reassigned to AAs or EAs, and also because the EA ancestry threshold is more stringent this round.

Ancestry	Subjects Pre-Filtering	Subjects Removed	Subjects Added	Subjects Post-Filtering
EA	651	36	4	619
AA	598	7	8	599

If this all sounds reasonable, I will continue with the QC processing of Kreek_AA and Kreek_EA.

jaamarks commented 5 years ago

Here are the triangle plots for the AAs and EAs. Below is also the pre-filtered self-reported HA triangle plot. I processed the self reported AAs, EAs, and HAs. Self-reported ancestry was then disregarded during the filtering so that any HA subject that was clearly AA or EA was reassigned. The filtering threshold used were:

Action Description	Thresholding Criteria
For AA retainment	(AFR > 25%) ∧ (EAS < 25%)
For EA retainment	(AFR < 10%) ∧ (EAS < 10%)

AA

Pre-filtered	Post-filtered

EA

Pre-filtered	Post-filtered

HA (prefilter)

Here is a breakdown of the filtering results.

Ancestry	Subjects Pre-Filtering	Subjects Removed	Subjects Added	Subjects Post-Filtering
EA	651	35	18	634
AA	598	7	125	716

jaamarks commented 5 years ago

Genotype QC for the Kreek data set is finished. The quality control statistics are below. We will now proceed with the oaall GWAS of Kreek_EA in an attempt to lower the inflated lambda value seen earlier.

European Ancestry

The table below provides statistics on variants and subjects filtered during each step of the QC process.

Note: Initial subject counts are determined by considering only subjects with both genotype and phenotype data available. Consequently, initial numbers may not match between the .fam files and the phenotype files

Autosome statistics

This table includes filtering statistics prior to merging with chrX.

QC procedure	Variants removed	Variants retained	Subjects removed	Subjects added	Subjects retained
* EA subject filter w/initial procedures (all chr)	0	565,791	1,004	0	651
* Partitioning to only autosomes	13,926	551,865	0	0	651
* Remove subjects missing whole autosome data	0	551,865	0	0	651
* Duplicate rsID filtering	0	551,865	0	0	651
* Strand flipping and polymorphic SNP filter	7,173	544,692	0	0	651
* Remove/Reassign ancestral outliers	0	544,692	35	18	634
* Remove variants with missing call rate > 3%	10,907	533,785	0	0	634
* Remove variants with HWE p < 0.0001	914	532,871	0	0	634

ChrX statistics

This table includes filtering statistics prior to merging with autosomes.

QC procedure	Variants removed	Variants retained	Subjects removed	Subjects added	Subjects retained
* EA subject filter w/initial procedures (all chr)	0	565,791	1,004	0	651
* Partitioning to only chrX	551,865	13,926	0	0	651
* Remove subjects missing whole chrX data	0	13,926	0	0	651
* Duplicate rsID filtering	0	13,926	0	0	651
* Remove/Reassign ancestral outliers	0	13,926	35	18	634
* Remove variants with missing call rate > 3%	194	13,732	0	0	634
* Remove variants with HWE p < 0.0001	1	13,731	0	0	634

Merged autosome and chrX statistics

QC procedure	Variants removed	Variants retained	Subjects removed	Subjects retained
* Merge autosomes and chrX	0	546,602	0	634
* Remove subjects with IBD > 0.4 or IBS > 0.9	0	546,602	17	617
* Remove subjects with missing call rate > 3%	0	546,602	3	614
* Sex discordance filter	0	546,602	0	614
* Excessive homozygosity filter	0	546,602	0	614
* Duplicate variant ID filter after 1000G renaming	1	546,601	0	614
* Remove ambiguous SNPs	36,549	510,052	0	614

African Ancestry

The table below provides statistics on variants and subjects filtered during each step of the QC process.

Note: Initial subject counts are determined by considering only subjects with both genotype and phenotype data available. Consequently, initial numbers may not match between the .fam files and the phenotype files

Autosome statistics

This table includes filtering statistics prior to merging with chrX.

QC procedure	Variants removed	Variants retained	Subjects removed	Subjects added	Subjects retained
* AA subject filter w/initial procedures (all chr)	0	565,791	1,057	0	598
* Partitioning to only autosomes	13,926	551,865	0	0	598
* Remove subjects missing whole autosome data	0	551,865	0	0	598
* Duplicate rsID filtering	0	551,865	0	0	598
* Strand flipping and polymorphic SNP filter	7,173	544,692	0	0	598
* Reassign ancestral outliers	0	544,692	7	125	716
* Remove variants with missing call rate > 3%	17,129	527,563	0	0	716
* Remove variants with HWE p < 0.0001	2,453	525,110	0	0	716

ChrX statistics

This table includes filtering statistics prior to merging with autosomes.

QC procedure	Variants removed	Variants retained	Subjects removed	Subjects added	Subjects retained
* AA subject filter w/initial procedures (all chr)	0	565,791	1,057	0	598
* Partitioning to only chrX	551,865	13,926	0	0	598
* Remove subjects missing whole chrX data	0	13,926	0	0	598
* Duplicate rsID filtering	0	13,926	0	0	598
* Remove/reassign ancestral outliers	0	13,926	7	125	716
* Remove variants with missing call rate > 3%	350	13,576	0	0	716
* Remove variants with HWE p < 0.0001	23	13,553	0	0	716

Merged autosome and chrX statistics

QC procedure	Variants removed	Variants retained	Subjects removed	Subjects retained
* Merge autosomes and chrX	0	538,663	0	716
* Remove subjects with IBD > 0.4 or IBS > 0.9	0	538,663	23	693
* Remove subjects with missing call rate > 3%	0	538,663	3	690
* Sex discordance filter	0	538,663	0	690
* Excessive homozygosity filter	0	538,663	0	690
* Duplicate variant ID filter after 1000G renaming	1	538,662	0	690
* Remove ambiguous SNPs	35,542	503,120	0	690

jaamarks commented 5 years ago

Kreek QC, imputation, & baseline OAall GWAS now complete—results/details below. EAs still exhibit inflation (lambda=1.176). AAs are copacetic.

QC Updates: This round had updates for both EA and AA ancestry groups. In the first round of QC (finished June '18) self-reported EA, AA, and HA were to be processed. It was then decided that we should disregard self-reported ancestry and reassign subjects to the ancestry their genotypes most closely align with. Though this was the intention, I don't believe the self-reported HA subjects were actually reassigned—I think they were excluded altogether. This is evident upon examining the QC summary statistics from the first round. It is most apparent when comparing the final subject counts for the AA subjects. So for this latest round of QC, I have:

Disregarded self-report ancestry and reassigned subjects to there closest ancestry group (including self-report HAs this time)
imposed a more stringent ancestral threshold for the EA group

GWAS phenotype updates: In the previous Kreek OAall GWAS I used the variable opioid_dep as the phenotype of interest in the model. After investigating some of the earlier posts about the Kreek data, I learned that I should define a new variable opioid_abdp which combines the opioid_dep (dependence) and opioid_ab (abuse) variables. So,

the opioid_abdp variable is the phenotype of interest used in this latest GWAS round

First round of QC (June '18)

Ancestry	Variant Count	Subject Count
AA	501,495	576
EA	509,619	624

Latest round of QC

Ancestry	Variant Count	Subject Count
AA	503,120	690
EA	510,052	614

AA PCA Plots

I included PC1 & PC10 in the model. These PCs explain ~75% of the variance.

EA PCA Plots

I included PC1, PC3, and PC5 in the model. These PCs explain ~77% of the variance.

AA GWAS Plots

oaall ~ SNP + sex + age + PC1 + PC10

Manhattan	QQ

EA GWAS Plots

oaall ~ SNP + sex + age + PC1 + PC3 + PC5

Manhattan	QQ

I also ran a GWAS for Kreek_EA with all of the top 10 PCs. This did not reduce inflation, in fact, it actually increased the lambda value slightly to 1.183 from 1.176. Here are those plots

EA GWAS Plots (all PCs included)

oaall ~ SNP + sex + age + PC1 + PC2 + ... + PC9 + PC10

Manhattan	QQ

jaamarks commented 5 years ago

Since there is still inflation exhibited by the EAs, another possible strategy is to apply an even more stringent ancestral threshold---in particular, apply the following thresholds and push through QC and GWAS. Note, that I do not have to impute those subjects again. I simply need the IDs of the subjects to keep. Attached are those subjects IDs of the subjects to keep in the triangle plot.

Action Description	Thresholding Criteria
For EA retainment	(AFR < 5%) ∧ (EAS < 5%)

EA Triangle Plot

Ancestry	Subjects Pre-Filtering	Subjects Removed	Subjects Added	Subjects Post-Filtering
EA	651	85	9	575

# number of self-report EAs retained
awk '$3=="Kreek_EA"' ea_filtered3 | wc -l
566

# number of self-report HAs reassigned to EA 
awk '$3=="Kreek_HA"' ea_filtered3 | wc -l
5

# number of self-report HAs reassigned to EA 
awk '$3=="Kreek_AA"' ea_filtered3 | wc -l
4

ea_subject_ids_5_percent.keep.txt

jaamarks commented 5 years ago

After some experimenting, I was able to reduce the inflation somewhat, , in the EAs by applying an even more stringent ancestral threshold during the Structure analysis:

Action Description	Thresholding Criteria
For EA retainment	(AFR < 5%) ∧ (EAS < 5%)

EA Triangle Plot

Structure Analysis Stats

Ancestry	Subjects Pre-Filtering	Subjects Removed	Subjects Added	Subjects Post-Filtering
EA	651	85	9	575

Manhattan	QQ

Note this is with all PCs included in the model as covariates. oaall ~ SNP + sex + age + PC1 + PC2 + ... + PC9 + PC10

We do include 57 fewer EA subjects in the GWAS though—556 compared to 614. So I don't know if we want to go this route.

jaamarks commented 5 years ago

Eric Johnson has indicated that we will go with these more stringent thresholds, and use these GWAS results in the NGC OAall meta-analysis. See GitHub issue #53.

jaamarks commented 5 years ago

The results+data are now on s3.

imputed data

s3://rti-heroin/kreek/data/genotype/imputed/20181128/{aa,ea}

GWAS results

s3://rti-heroin/kreek/results/ea/oaall/20181201
s3://rti-heroin/kreek/results/aa/oaall/20181128

jaamarks / jaamarks_notebooks

Kreek EA—revisit #4

EA

Sample R code

AA

EA

HA (prefilter)

European Ancestry

Autosome statistics

ChrX statistics

Merged autosome and chrX statistics

African Ancestry

Autosome statistics

ChrX statistics

Merged autosome and chrX statistics

First round of QC (June '18)

Latest round of QC

AA PCA Plots

EA PCA Plots

AA GWAS Plots

EA GWAS Plots

EA GWAS Plots (all PCs included)

EA Triangle Plot

Structure Analysis Stats