jaamarks commented 5 years ago

We are starting from scratch with VIDUS and going to start with QC and push these data all the way through to a baseline GWAS as well as GxSex GWAS.

The totally unprocessed results are here:

s3://rti-midas-data/studies/vidus/observed/unprocessed/Data_Transfer_12SEP2017/Reports/Genotype_Reports/Eric_Johnson_OmniExpress_re-analysis_A1_Manifest_GENOTYPE-REPORT_AUG2017.txt

However, if you want to start with the bed/bim/fam files that I generated from the unprocessed data, they are here:

s3://rti-midas-data/studies/vidus/observed/processing/vidus.bed
s3://rti-midas-data/studies/vidus/observed/processing/vidus.bim
s3://rti-midas-data/studies/vidus/observed/processing/vidus.fam

jaamarks commented 5 years ago

Prefiltered (994 subjects)

Filtered

Ancestry	Subject Count	Retainment Thresholds
EA	971	(AFR < 25%) ∧ (EAS < 25%)
AA	19	(AFR > 25%) ∧ (EAS < 25%)
HA	3	(AFR < 25%) ∧ (EAS > 25%)

Note: 1 subject not assigned

ID	study	% AFR	% EAS	% EUR
8005964970_	VIDUS	0.349	0.272	0.379

jaamarks commented 5 years ago

Below are the QC summary stats report. The tables below provide summary statistics on variants and subjects filtered during each step of the QC process.

European Ancestry

Initial quality control sample tracking

QC procedure	Variants removed	Variants retained	Subjects removed	Subjects retained
Initial dbGaP dataset	0	713,599	0	1,014
Remove positive controls	0	713,599	20	994
Duplicate rsID filtering	0	713,599	0	994
Genome build 37 & b138 update	1,377	712,222	0	994

Autosome statistics

This table includes filtering statistics prior to merging with chrX.

QC procedure	Variants removed	Variants retained	Subjects removed	Subjects added	Subjects retained
EA subject filter w/initial procedures (all chr)	0	712,222	0	0	994
Partitioning to only autosomes	19,227	692,995	0	0	994
Remove subjects missing whole autosome data	0	692,995	0	0	994
Add + reassigned subjects based on STRUCTURE	0	692,995	23	0	971
Strand flipping and polymorphic SNP filter	0	692,995	0	0	971
Remove variants with missing call rate > 3%	8,731	684,264	0	0	971
Remove variants with HWE p < 0.0001					971

ChrX statistics

This table includes filtering statistics prior to merging with autosomes.

QC procedure	Variants removed	Variants retained	Subjects removed	Subjects retained
EA subject filter w/initial procedures (all chr)	0	712,222	0	994
Partitioning to only chrX	694,249	17,973	0	994
Add + reassigned subjects based on STRUCTURE	0	17,973	23	971
Remove subjects missing whole chrX data	0	17,973	0	971
Duplicate rsID filtering	0	17,973	0	971
Remove variants with missing call rate > 3%	1,094	16,879	0	971
Remove variants with HWE p < 0.0001	13,989	2,890	0	971

Merged autosome and chrX statistics

QC procedure	Variants retained	Subjects removed	Subjects retained
Merge autosomes and chrX	686,744	0	971
Remove subjects with IBD > 0.4 or IBS > 0.9	686,744	21	950
Remove subjects with missing call rate > 3%	686,744	10	940
Sex discordance filter	686,744	0	940
Excessive homozygosity filter	686,744	0	940
Duplicate variant ID filter after 1000G renaming	686,744	0	940

Theses results have been uploaded to S3: s3://rti-hiv/vidus/data/genotype/original/final

jaamarks commented 5 years ago

Going to rerun QC, triangle plot looks quite different from Nathan's and my chrX needs reprocessing - HWE should only be run on females.

jaamarks commented 5 years ago

The plots below indicate that the data going into the STRUCTURE analysis are the same.

jaamarks commented 5 years ago

Projecting the high-dimensional genome-wide variant data into 3d subspace. 1000G vs VIDUS.

jaamarks commented 5 years ago

STRUCTURE analysis.

HapMap3

afr_eas_eur_vidus_hapmap3_20190131

1000g

HapMap3 comparison

Jesse's	Original

jaamarks commented 5 years ago

Comparing the A1 frequency of my data to Nathan's data.

jaamarks commented 5 years ago

Note that in the latest run of the STRUCTURE analysis, I include one more subject in the analysis that was indicated to be removed in earlier analyses.

The subject is:

num     ID      pop     AFR      EAS       EUR
2587    8005964586_     vidus_EA        0.240   0.097   0.663

In a previous analysis this EA subject was classified as an AA an was thus filtered out:

num     ID      pop     AFR      EAS       EUR
2587    8005964586_     vidus_AA        0.254   0.132   0.614

As you can see, this subject is right on the cusp. I am choosing to include this subject in the post-STRUCTURE QC.

The full ID is: 8005964586_NA0054435_93-0328_8005964586_NA0054435_93-0328

jaamarks commented 5 years ago

Note that I did not have the sex discrepancies that were an issue in the original QC ran by Nathan G.

jaamarks commented 5 years ago

Sample 8005964634_NA0054172_93-0164 was removed during the ibs_gt_0.9_ibd_gt_0.4 filter.

jaamarks commented 5 years ago

vidus ea 1000g hiv_acq maf_gt_0 01 rsq_gt_0 3 assoc plot all_chr snps indels manhattan

jaamarks commented 5 years ago

QC Overview

Below I cover some of the issues I have encountered while QCing the VIDUS genotype data.

STRUCTURE analysis troubleshooting
Sex Discrepancies

STRUCTURE analysis troubleshooting

Because of the dubious looking STRUCTURE plots produced when comparing VIDUS to 1000G phase 3 subjects, we redid the STRUCTURE analysis by comparing VIDUS instead to HapMap3 subjects. Below is the resulting STRUCTURE plot. We also performed a PCA on the ld-pruned variants for VIDUS & 1000G, results below. Furthermore, we compared the MAF for chr22 of my processed data and the processed data of @ngaddis at the point directly before the STRUCTURE analysis. These data were in perfect agreement, as well as the A1 frequencies I also compared.

After validating that our data were the same going into the STRUCTURE analysis, I proceeded. Our STRUCTURE analyses identified the same individuals to be removed, except for one subject. I included one more subject 8005964586_NA0054435_93-0328. This subject was on the cusp of being removed but narrowly past the cutoff thresholds for being labeled EA, so they were retained.

Ancestry	Retainment Thresholds
EA	(AFR < 25%) ∧ (EAS < 25%)

num     ID      pop     AFR      EAS       EUR
2587    8005964586_     vidus_EA        0.240   0.097   0.663

Sex discrepancies

I did not experience any sex discrepancies whereas there were reported discrepancies before. After digging back through earlier comments in this GitHub Issue, I noted that @ngaddis had encountered sex discrepancies. In particular, he noted the following subjects had discrepant sexes

They were subsequently removed from any further analyses. I did not encounter this same issue. The thresholds used to verify that self-reported gender status matches with the genetic information are

female	male
F < 0.2	F > 0.8

Here are the results from my analysis of those 7 subjects:

Subject ID	Self Report	F
8005964966_NA0054149_93-0143	Female	-0.01959
8005964978_NA0054150_93-0144	Male	0.9893
8015124413_NA0055272_93-0623	Male	0.9827
8015124401_NA0055273_93-0624	Female	-0.03714
8015124354_NA0055285_93-0628	Male	0.9861
8005965380_NA0055556_93-0720	Male	0.9888
8005964955_NA0056463_93-1068	Female	0.0587

I obtained this information from the file on S3 at s3://rti-midas-data/studies/vidus/observed/processing/Eric_Johnson_OmniExpress_re-analysis_A1_Manifest_QC-REPORT_AUG2017.txt

@ngaddis I didn't see in your methods file where you had applied the sex filter. Could you point me to a methods file you used to perform the sex check? Then I could compare our processing steps.

jaamarks commented 5 years ago

A couple errors were found after the GWAS was performed on the VIDUS data, therefore the GWAS was repeated. One error was in the phenotype file (PED file) appended to the rvtests for the GWAS. It was not constructed accurately. In particular, the age column contained the wrong data. This was corrected. The other error found was in selecting the model for rvtest. The initial run I performed had linear model selected when it should have been a logistic model since the case/control status of HIV acquisition is a binary variable. I corrected this as well.

The chrX data was also analyzed during this run. Below you will see the latest results. There was once SNP whose P-value was beyond the significance threshold of 5E-8. This SNP is rs58194125, and is in the gene DSCAM. You can find more information on this SNP here.

CHROM   POS     REF     ALT     N_INFORMATIVE   AF      INFORMATIVE_ALT_AC      CALL_RATE       HWE_PVALUE      N_REF   N_HET   N_ALT   U_STAT  SQRT_V_STAT    ALT_EFFSIZE     PVALUE
21      42042450        G       A       938     0.0154051:0.0312787:0.00954234  28.9:15.827:13.073      1:1:1   1:1:1   920:239:681     18:14:4 0:0:0 8.62447  1.57832 3.46212 4.6465e-08

vidus ea 1000g oaall maf_gt_0 01 rsq_gt_0 3 assoc plot all_chr snps indels manhattan

jaamarks commented 5 years ago

Results comparison: my GWAS results using rvtests and Fang's GWAS results using ProbAbel. Note that Fang's plots were retrieved from S3 at the paths:

s3://rti-hiv/vidas_data/results/vidus.ea.palogist.status~snp+gender+age+EV.plots.snps+indels.manhattan.png 
s3://rti-hiv/vidas_data/results/vidus.ea.palogist.status~snp+gender+age+EV.plots.snps+indels.qq.png

My results (RVTESTS)	Fang's results (ProbAbel)

jaamarks commented 5 years ago

Note that my males and females are coded the same as those of Fang's.

original data (GWAS-Cohort-n938_passed_g_qc_only_HIV\ status_baseline.csv)

gwas_code,female,hiv,ageatint
1,0,0,55
2,0,1,59
3,0,1,55
4,0,1,54
5,0,1,40
6,1,0,35
7,0,0,56
8,0,0,51

My phenotype file

IID HIV_ACQ AGE SEX PC9 PC4 PC1
8005965211_NA0055425_93-0001_8005965211_NA0055425_93-0001 0 55 0 -0.0181 -0.0017 -0.0073
8005964642_NA0053910_93-0002_8005964642_NA0053910_93-0002 1 59 0 0.0041 -0.0088 -0.0053
8005964654_NA0053915_93-0003_8005964654_NA0053915_93-0003 1 55 0 0.0191 -0.0025 4e-04
8005964666_NA0053918_93-0004_8005964666_NA0053918_93-0004 1 54 0 0.0103 -8e-04 0.0088
8005964583_NA0053919_93-0005_8005964583_NA0053919_93-0005 1 40 0 0.0068 -0.0033 0.0134
8005964584_NA0053917_93-0006_8005964584_NA0053917_93-0006 0 35 1 0.021 9e-04 -0.0084
8005964667_NA0053921_93-0007_8005964667_NA0053921_93-0007 0 56 0 -0.0479 -0.0039 0.1496
8005964619_NA0053923_93-0008_8005964619_NA0053923_93-0008 0 51 0 -0.0047 -8e-04 -0.0118

Fang's phenotype file

iid     hiv     gender  age     pc6     pc1     pc9     pc10
8005965211_NA0055425_93-0001_8005965211_NA0055425_93-0001       0       0       55      -0.0009 -0.0029 -0.0455 0.0017
8005964642_NA0053910_93-0002_8005964642_NA0053910_93-0002       1       0       59      -0.0252 -0.0038 -0.0610 -0.0029
8005964654_NA0053915_93-0003_8005964654_NA0053915_93-0003       1       0       55      -0.0031 0.0006  0.0259  -0.0023
8005964666_NA0053918_93-0004_8005964666_NA0053918_93-0004       1       0       54      -0.0216 -0.0165 -0.0121 -0.0089
8005964583_NA0053919_93-0005_8005964583_NA0053919_93-0005       1       0       40      0.0996  0.2773  0.1130  0.1389
8005964584_NA0053917_93-0006_8005964584_NA0053917_93-0006       0       1       35      0.0024  -0.0125 0.0133  0.0105
8005964667_NA0053921_93-0007_8005964667_NA0053921_93-0007       0       0       56      0.0046  0.0143  -0.0326 -0.0088
8005964619_NA0053923_93-0008_8005964619_NA0053923_93-0008       0       0       51      0.0149  -0.0021 -0.0100 -0.0151

The phenotype files are in agreement with regards to sex. Females are coded as 1 and males coded as 0.

jaamarks commented 5 years ago

Removing sex discrepant samples: 7 samples had discordant sex classifications between the genotype and phenotype data. See Nathan's post in GitHub issue #42. Because of this, I am rerunning the baseline GWAS and the 2df (GxSex) GWAS with those samples removed. This brings up an important question regarding the PCs included in the model as covariates: is it OK to perform the GWAS by simply excluding these individuals and using the same model (i.e. include the same PCs as covariates)? Wouldn't we technically need to exclude those subjects from the original genotype that was then filtered by LD-pruning and subsequently used in Eigenstrat smartpca to obtain the PCs to include in the model as covariates? I think the vector space would be different since we are excluding those 7 samples, however would it be enough to have a significant impact on the PCs used in the model. I will consult with colleagues about this.

ngaddis commented 5 years ago

We typically regenerate the EVs using just the individuals that are being analyzed.

From: jaamarks notifications@github.com Sent: Thursday, February 21, 2019 11:11 AM To: RTIInternational/jaamarks_notebooks Cc: Gaddis, Nathan; Mention Subject: Re: [RTIInternational/jaamarks_notebooks] VIDUS: QC + GWAS (#6)

Removing sex discrepant samples: 7 samples had discordant sex classifications between the genotype and phenotype data. See Nathan's post in GitHub issue #42https://github.com/RTIInternational/bioinformatics/issues/42#issuecomment-465608600. Because of this, I am rerunning the baseline GWAS and the 2df (GxSex) GWAS with those samples removed. This brings up an important question regarding the PCs included in the model as covariates: is it OK to perform the GWAS by simply excluding these individuals and using the same model (i.e. include the same PCs as covariates)? Wouldn't we technically need to exclude those subjects from the original genotype that was then filtered by LD-pruning and subsequently used in Eigenstrat smartpca to obtain the PCs to include in the model as covariates? I think the vector space would be different since we are excluding those 7 samples, however would it be enough to have a significant impact on the PCs used in the model. I will consult with colleagues about this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/RTIInternational/jaamarks_notebooks/issues/6#issuecomment-466059624, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJskEf_R1p_CKXZJN8khVZW3b3f4b10Cks5vPsUwgaJpZM4Z75j1.

jaamarks commented 5 years ago

7 sex-discrepant samples were removed (N=931)
EVs were regenerated per Nathan's recommendation (new EVs: 1, 5, 9 with PVE~88)
1df & 2df HIV acquisition GWAS performed
- HIV_ACQ ~ SNP + AGE + SEX + PC1 + PC5 + PC9
- HIV_ACQ ~ SNP + AGE + SEX + PC1 + PC5 + PC9 + SNP*SEX
Top SNPs P-value comparison between Fang's & my data