Closed jaamarks closed 5 years ago
Ancestry | Subject Count | Retainment Thresholds |
---|---|---|
EA | 971 | (AFR < 25%) ∧ (EAS < 25%) |
AA | 19 | (AFR > 25%) ∧ (EAS < 25%) |
HA | 3 | (AFR < 25%) ∧ (EAS > 25%) |
Note: 1 subject not assigned
ID | study | % AFR | % EAS | % EUR |
---|---|---|---|---|
8005964970_ | VIDUS | 0.349 | 0.272 | 0.379 |
Below are the QC summary stats report. The tables below provide summary statistics on variants and subjects filtered during each step of the QC process.
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects retained |
---|---|---|---|---|
Initial dbGaP dataset | 0 | 713,599 | 0 | 1,014 |
Remove positive controls | 0 | 713,599 | 20 | 994 |
Duplicate rsID filtering | 0 | 713,599 | 0 | 994 |
Genome build 37 & b138 update | 1,377 | 712,222 | 0 | 994 |
This table includes filtering statistics prior to merging with chrX.
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects added | Subjects retained |
---|---|---|---|---|---|
EA subject filter w/initial procedures (all chr) | 0 | 712,222 | 0 | 0 | 994 |
Partitioning to only autosomes | 19,227 | 692,995 | 0 | 0 | 994 |
Remove subjects missing whole autosome data | 0 | 692,995 | 0 | 0 | 994 |
Add + reassigned subjects based on STRUCTURE | 0 | 692,995 | 23 | 0 | 971 |
Strand flipping and polymorphic SNP filter | 0 | 692,995 | 0 | 0 | 971 |
Remove variants with missing call rate > 3% | 8,731 | 684,264 | 0 | 0 | 971 |
Remove variants with HWE p < 0.0001 | 971 |
This table includes filtering statistics prior to merging with autosomes.
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects added | Subjects retained |
---|---|---|---|---|---|
EA subject filter w/initial procedures (all chr) | 0 | 712,222 | 0 | 0 | 994 |
Partitioning to only chrX | 694,249 | 17,973 | 0 | 0 | 994 |
Add + reassigned subjects based on STRUCTURE | 0 | 17,973 | 23 | 0 | 971 |
Remove subjects missing whole chrX data | 0 | 17,973 | 0 | 0 | 971 |
Duplicate rsID filtering | 0 | 17,973 | 0 | 0 | 971 |
Remove variants with missing call rate > 3% | 1,094 | 16,879 | 0 | 0 | 971 |
Remove variants with HWE p < 0.0001 | 13,989 | 2,890 | 0 | 0 | 971 |
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects added | Subjects retained |
---|---|---|---|---|---|
Merge autosomes and chrX | 0 | 686,744 | 0 | 0 | 971 |
Remove subjects with IBD > 0.4 or IBS > 0.9 | 0 | 686,744 | 21 | 0 | 950 |
Remove subjects with missing call rate > 3% | 0 | 686,744 | 10 | 0 | 940 |
Sex discordance filter | 0 | 686,744 | 0 | 0 | 940 |
Excessive homozygosity filter | 0 | 686,744 | 0 | 0 | 940 |
Duplicate variant ID filter after 1000G renaming | 0 | 686,744 | 0 | 0 | 940 |
Theses results have been uploaded to S3: s3://rti-hiv/vidus/data/genotype/original/final
Going to rerun QC, triangle plot looks quite different from Nathan's and my chrX needs reprocessing - HWE should only be run on females.
The plots below indicate that the data going into the STRUCTURE analysis are the same.
Projecting the high-dimensional genome-wide variant data into 3d subspace. 1000G vs VIDUS.
STRUCTURE analysis.
Jesse's | Original |
---|---|
Comparing the A1 frequency of my data to Nathan's data.
Note that in the latest run of the STRUCTURE analysis, I include one more subject in the analysis that was indicated to be removed in earlier analyses.
The subject is:
num ID pop AFR EAS EUR
2587 8005964586_ vidus_EA 0.240 0.097 0.663
In a previous analysis this EA subject was classified as an AA an was thus filtered out:
num ID pop AFR EAS EUR
2587 8005964586_ vidus_AA 0.254 0.132 0.614
As you can see, this subject is right on the cusp. I am choosing to include this subject in the post-STRUCTURE QC.
The full ID is:
8005964586_NA0054435_93-0328_8005964586_NA0054435_93-0328
Note that I did not have the sex discrepancies that were an issue in the original QC ran by Nathan G.
Sample 8005964634_NA0054172_93-0164
was removed during the ibs_gt_0.9_ibd_gt_0.4 filter.
Below I cover some of the issues I have encountered while QCing the VIDUS genotype data.
Because of the dubious looking STRUCTURE plots produced when comparing VIDUS to 1000G phase 3 subjects, we redid the STRUCTURE analysis by comparing VIDUS instead to HapMap3 subjects. Below is the resulting STRUCTURE plot. We also performed a PCA on the ld-pruned variants for VIDUS & 1000G, results below. Furthermore, we compared the MAF for chr22 of my processed data and the processed data of @ngaddis at the point directly before the STRUCTURE analysis. These data were in perfect agreement, as well as the A1 frequencies I also compared.
After validating that our data were the same going into the STRUCTURE analysis, I proceeded. Our STRUCTURE analyses identified the same individuals to be removed, except for one subject. I included one more subject 8005964586_NA0054435_93-0328
. This subject was on the cusp of being removed but narrowly past the cutoff thresholds for being labeled EA, so they were retained.
Ancestry | Retainment Thresholds |
---|---|
EA | (AFR < 25%) ∧ (EAS < 25%) |
num ID pop AFR EAS EUR
2587 8005964586_ vidus_EA 0.240 0.097 0.663
I did not experience any sex discrepancies whereas there were reported discrepancies before. After digging back through earlier comments in this GitHub Issue, I noted that @ngaddis had encountered sex discrepancies. In particular, he noted the following subjects had discrepant sexes
143
144
623
624
628
720
1068
They were subsequently removed from any further analyses. I did not encounter this same issue. The thresholds used to verify that self-reported gender status matches with the genetic information are
female | male |
---|---|
F < 0.2 | F > 0.8 |
Here are the results from my analysis of those 7 subjects:
Subject ID | Self Report | F |
---|---|---|
8005964966_NA0054149_93-0143 | Female | -0.01959 |
8005964978_NA0054150_93-0144 | Male | 0.9893 |
8015124413_NA0055272_93-0623 | Male | 0.9827 |
8015124401_NA0055273_93-0624 | Female | -0.03714 |
8015124354_NA0055285_93-0628 | Male | 0.9861 |
8005965380_NA0055556_93-0720 | Male | 0.9888 |
8005964955_NA0056463_93-1068 | Female | 0.0587 |
I obtained this information from the file on S3 at
s3://rti-midas-data/studies/vidus/observed/processing/Eric_Johnson_OmniExpress_re-analysis_A1_Manifest_QC-REPORT_AUG2017.txt
@ngaddis I didn't see in your methods file where you had applied the sex filter. Could you point me to a methods file you used to perform the sex check? Then I could compare our processing steps.
A couple errors were found after the GWAS was performed on the VIDUS data, therefore the GWAS was repeated. One error was in the phenotype file (PED file) appended to the rvtests for the GWAS. It was not constructed accurately. In particular, the age
column contained the wrong data. This was corrected. The other error found was in selecting the model for rvtest. The initial run I performed had linear model selected when it should have been a logistic model since the case/control status of HIV acquisition is a binary variable. I corrected this as well.
The chrX data was also analyzed during this run. Below you will see the latest results. There was once SNP whose P-value was beyond the significance threshold of 5E-8. This SNP is rs58194125, and is in the gene DSCAM. You can find more information on this SNP here.
CHROM POS REF ALT N_INFORMATIVE AF INFORMATIVE_ALT_AC CALL_RATE HWE_PVALUE N_REF N_HET N_ALT U_STAT SQRT_V_STAT ALT_EFFSIZE PVALUE
21 42042450 G A 938 0.0154051:0.0312787:0.00954234 28.9:15.827:13.073 1:1:1 1:1:1 920:239:681 18:14:4 0:0:0 8.62447 1.57832 3.46212 4.6465e-08
Results comparison: my GWAS results using rvtests and Fang's GWAS results using ProbAbel. Note that Fang's plots were retrieved from S3 at the paths:
s3://rti-hiv/vidas_data/results/vidus.ea.palogist.status~snp+gender+age+EV.plots.snps+indels.manhattan.png
s3://rti-hiv/vidas_data/results/vidus.ea.palogist.status~snp+gender+age+EV.plots.snps+indels.qq.png
My results (RVTESTS) | Fang's results (ProbAbel) |
---|---|
Note that my males and females are coded the same as those of Fang's.
gwas_code,female,hiv,ageatint
1,0,0,55
2,0,1,59
3,0,1,55
4,0,1,54
5,0,1,40
6,1,0,35
7,0,0,56
8,0,0,51
IID HIV_ACQ AGE SEX PC9 PC4 PC1
8005965211_NA0055425_93-0001_8005965211_NA0055425_93-0001 0 55 0 -0.0181 -0.0017 -0.0073
8005964642_NA0053910_93-0002_8005964642_NA0053910_93-0002 1 59 0 0.0041 -0.0088 -0.0053
8005964654_NA0053915_93-0003_8005964654_NA0053915_93-0003 1 55 0 0.0191 -0.0025 4e-04
8005964666_NA0053918_93-0004_8005964666_NA0053918_93-0004 1 54 0 0.0103 -8e-04 0.0088
8005964583_NA0053919_93-0005_8005964583_NA0053919_93-0005 1 40 0 0.0068 -0.0033 0.0134
8005964584_NA0053917_93-0006_8005964584_NA0053917_93-0006 0 35 1 0.021 9e-04 -0.0084
8005964667_NA0053921_93-0007_8005964667_NA0053921_93-0007 0 56 0 -0.0479 -0.0039 0.1496
8005964619_NA0053923_93-0008_8005964619_NA0053923_93-0008 0 51 0 -0.0047 -8e-04 -0.0118
iid hiv gender age pc6 pc1 pc9 pc10
8005965211_NA0055425_93-0001_8005965211_NA0055425_93-0001 0 0 55 -0.0009 -0.0029 -0.0455 0.0017
8005964642_NA0053910_93-0002_8005964642_NA0053910_93-0002 1 0 59 -0.0252 -0.0038 -0.0610 -0.0029
8005964654_NA0053915_93-0003_8005964654_NA0053915_93-0003 1 0 55 -0.0031 0.0006 0.0259 -0.0023
8005964666_NA0053918_93-0004_8005964666_NA0053918_93-0004 1 0 54 -0.0216 -0.0165 -0.0121 -0.0089
8005964583_NA0053919_93-0005_8005964583_NA0053919_93-0005 1 0 40 0.0996 0.2773 0.1130 0.1389
8005964584_NA0053917_93-0006_8005964584_NA0053917_93-0006 0 1 35 0.0024 -0.0125 0.0133 0.0105
8005964667_NA0053921_93-0007_8005964667_NA0053921_93-0007 0 0 56 0.0046 0.0143 -0.0326 -0.0088
8005964619_NA0053923_93-0008_8005964619_NA0053923_93-0008 0 0 51 0.0149 -0.0021 -0.0100 -0.0151
The phenotype files are in agreement with regards to sex. Females are coded as 1 and males coded as 0.
Removing sex discrepant samples: 7 samples had discordant sex classifications between the genotype and phenotype data. See Nathan's post in GitHub issue #42. Because of this, I am rerunning the baseline GWAS and the 2df (GxSex) GWAS with those samples removed. This brings up an important question regarding the PCs included in the model as covariates: is it OK to perform the GWAS by simply excluding these individuals and using the same model (i.e. include the same PCs as covariates)? Wouldn't we technically need to exclude those subjects from the original genotype that was then filtered by LD-pruning and subsequently used in Eigenstrat smartpca to obtain the PCs to include in the model as covariates? I think the vector space would be different since we are excluding those 7 samples, however would it be enough to have a significant impact on the PCs used in the model. I will consult with colleagues about this.
We typically regenerate the EVs using just the individuals that are being analyzed.
From: jaamarks notifications@github.com Sent: Thursday, February 21, 2019 11:11 AM To: RTIInternational/jaamarks_notebooks Cc: Gaddis, Nathan; Mention Subject: Re: [RTIInternational/jaamarks_notebooks] VIDUS: QC + GWAS (#6)
Removing sex discrepant samples: 7 samples had discordant sex classifications between the genotype and phenotype data. See Nathan's post in GitHub issue #42https://github.com/RTIInternational/bioinformatics/issues/42#issuecomment-465608600. Because of this, I am rerunning the baseline GWAS and the 2df (GxSex) GWAS with those samples removed. This brings up an important question regarding the PCs included in the model as covariates: is it OK to perform the GWAS by simply excluding these individuals and using the same model (i.e. include the same PCs as covariates)? Wouldn't we technically need to exclude those subjects from the original genotype that was then filtered by LD-pruning and subsequently used in Eigenstrat smartpca to obtain the PCs to include in the model as covariates? I think the vector space would be different since we are excluding those 7 samples, however would it be enough to have a significant impact on the PCs used in the model. I will consult with colleagues about this.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/RTIInternational/jaamarks_notebooks/issues/6#issuecomment-466059624, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJskEf_R1p_CKXZJN8khVZW3b3f4b10Cks5vPsUwgaJpZM4Z75j1.
The EVs were generated per Nathan's recommendation above.
Based on these results we will use PC1, PC5, PC9. These PCs explain ~88% of the variance.
Here are the results for the baseline HIV acquisition GWAS of VIDUS (N=931).
Results stored on S3:
rti-hiv/vidus/results/1df/final
Here are the results for the GxSex joint 2df analysis of HIV acquisition of VIDUS (N=931).
Results stored on S3:
rti-hiv/vidus/results/1df/final
We are starting from scratch with VIDUS and going to start with QC and push these data all the way through to a baseline GWAS as well as GxSex GWAS.
The totally unprocessed results are here:
However, if you want to start with the bed/bim/fam files that I generated from the unprocessed data, they are here: