jaamarks / jaamarks_notebooks

Collection of various projects and procedures documented with Jupyter notebooks.
0 stars 0 forks source link

VIDUS: QC + GWAS (1df+2df) #6

Closed jaamarks closed 5 years ago

jaamarks commented 5 years ago

We are starting from scratch with VIDUS and going to start with QC and push these data all the way through to a baseline GWAS as well as GxSex GWAS.

The totally unprocessed results are here:

s3://rti-midas-data/studies/vidus/observed/unprocessed/Data_Transfer_12SEP2017/Reports/Genotype_Reports/Eric_Johnson_OmniExpress_re-analysis_A1_Manifest_GENOTYPE-REPORT_AUG2017.txt

However, if you want to start with the bed/bim/fam files that I generated from the unprocessed data, they are here:

s3://rti-midas-data/studies/vidus/observed/processing/vidus.bed
s3://rti-midas-data/studies/vidus/observed/processing/vidus.bim
s3://rti-midas-data/studies/vidus/observed/processing/vidus.fam
jaamarks commented 5 years ago

Prefiltered (994 subjects)

image

Filtered

image

image

image

Ancestry Subject Count Retainment Thresholds
EA 971 (AFR < 25%) ∧ (EAS < 25%)
AA 19 (AFR > 25%) ∧ (EAS < 25%)
HA 3 (AFR < 25%) ∧ (EAS > 25%)


Note: 1 subject not assigned

ID study % AFR % EAS % EUR
8005964970_ VIDUS 0.349 0.272 0.379
jaamarks commented 5 years ago

Below are the QC summary stats report. The tables below provide summary statistics on variants and subjects filtered during each step of the QC process.




European Ancestry

Initial quality control sample tracking

QC procedure Variants removed Variants retained Subjects removed Subjects retained
Initial dbGaP dataset 0 713,599 0 1,014
Remove positive controls 0 713,599 20 994
Duplicate rsID filtering 0 713,599 0 994
Genome build 37 & b138 update 1,377 712,222 0 994



Autosome statistics

This table includes filtering statistics prior to merging with chrX.

QC procedure Variants removed Variants retained Subjects removed Subjects added Subjects retained
EA subject filter w/initial procedures (all chr) 0 712,222 0 0 994
Partitioning to only autosomes 19,227 692,995 0 0 994
Remove subjects missing whole autosome data 0 692,995 0 0 994
Add + reassigned subjects based on STRUCTURE 0 692,995 23 0 971
Strand flipping and polymorphic SNP filter 0 692,995 0 0 971
Remove variants with missing call rate > 3% 8,731 684,264 0 0 971
Remove variants with HWE p < 0.0001 971



ChrX statistics

This table includes filtering statistics prior to merging with autosomes.

QC procedure Variants removed Variants retained Subjects removed Subjects added Subjects retained
EA subject filter w/initial procedures (all chr) 0 712,222 0 0 994
Partitioning to only chrX 694,249 17,973 0 0 994
Add + reassigned subjects based on STRUCTURE 0 17,973 23 0 971
Remove subjects missing whole chrX data 0 17,973 0 0 971
Duplicate rsID filtering 0 17,973 0 0 971
Remove variants with missing call rate > 3% 1,094 16,879 0 0 971
Remove variants with HWE p < 0.0001 13,989 2,890 0 0 971



Merged autosome and chrX statistics

QC procedure Variants removed Variants retained Subjects removed Subjects added Subjects retained
Merge autosomes and chrX 0 686,744 0 0 971
Remove subjects with IBD > 0.4 or IBS > 0.9 0 686,744 21 0 950
Remove subjects with missing call rate > 3% 0 686,744 10 0 940
Sex discordance filter 0 686,744 0 0 940
Excessive homozygosity filter 0 686,744 0 0 940
Duplicate variant ID filter after 1000G renaming 0 686,744 0 0 940



Theses results have been uploaded to S3: s3://rti-hiv/vidus/data/genotype/original/final

jaamarks commented 5 years ago

Going to rerun QC, triangle plot looks quite different from Nathan's and my chrX needs reprocessing - HWE should only be run on females.

jaamarks commented 5 years ago

The plots below indicate that the data going into the STRUCTURE analysis are the same.

image

jaamarks commented 5 years ago

Projecting the high-dimensional genome-wide variant data into 3d subspace. 1000G vs VIDUS.

image

image

image

image

jaamarks commented 5 years ago

STRUCTURE analysis.

HapMap3

afr_eas_eur_vidus_hapmap3_20190131

1000g

image






HapMap3 comparison

Jesse's Original
afr_eas_eur_vidus_hapmap3_20190131 vidus structure plot nathan gaddis
jaamarks commented 5 years ago

Comparing the A1 frequency of my data to Nathan's data.

image

jaamarks commented 5 years ago

Note that in the latest run of the STRUCTURE analysis, I include one more subject in the analysis that was indicated to be removed in earlier analyses.

The subject is:

num     ID      pop     AFR      EAS       EUR
2587    8005964586_     vidus_EA        0.240   0.097   0.663

In a previous analysis this EA subject was classified as an AA an was thus filtered out:

num     ID      pop     AFR      EAS       EUR
2587    8005964586_     vidus_AA        0.254   0.132   0.614

As you can see, this subject is right on the cusp. I am choosing to include this subject in the post-STRUCTURE QC.

The full ID is: 8005964586_NA0054435_93-0328_8005964586_NA0054435_93-0328

jaamarks commented 5 years ago

Note that I did not have the sex discrepancies that were an issue in the original QC ran by Nathan G.

jaamarks commented 5 years ago

Sample 8005964634_NA0054172_93-0164 was removed during the ibs_gt_0.9_ibd_gt_0.4 filter.

jaamarks commented 5 years ago

vidus ea 1000g hiv_acq maf_gt_0 01 rsq_gt_0 3 assoc plot all_chr snps indels manhattan vidus ea 1000g hiv_acq maf_gt_0 01 rsq_gt_0 3 assoc plot all_chr snps indels qq

jaamarks commented 5 years ago

QC Overview

Below I cover some of the issues I have encountered while QCing the VIDUS genotype data.



STRUCTURE analysis troubleshooting

Because of the dubious looking STRUCTURE plots produced when comparing VIDUS to 1000G phase 3 subjects, we redid the STRUCTURE analysis by comparing VIDUS instead to HapMap3 subjects. Below is the resulting STRUCTURE plot. We also performed a PCA on the ld-pruned variants for VIDUS & 1000G, results below. Furthermore, we compared the MAF for chr22 of my processed data and the processed data of @ngaddis at the point directly before the STRUCTURE analysis. These data were in perfect agreement, as well as the A1 frequencies I also compared.

After validating that our data were the same going into the STRUCTURE analysis, I proceeded. Our STRUCTURE analyses identified the same individuals to be removed, except for one subject. I included one more subject 8005964586_NA0054435_93-0328. This subject was on the cusp of being removed but narrowly past the cutoff thresholds for being labeled EA, so they were retained.

Ancestry Retainment Thresholds
EA (AFR < 25%) ∧ (EAS < 25%)
num     ID      pop     AFR      EAS       EUR
2587    8005964586_     vidus_EA        0.240   0.097   0.663

image

image

image

image

image




Sex discrepancies

I did not experience any sex discrepancies whereas there were reported discrepancies before. After digging back through earlier comments in this GitHub Issue, I noted that @ngaddis had encountered sex discrepancies. In particular, he noted the following subjects had discrepant sexes

143
144
623
624
628
720
1068

They were subsequently removed from any further analyses. I did not encounter this same issue. The thresholds used to verify that self-reported gender status matches with the genetic information are

female male
F < 0.2 F > 0.8

Here are the results from my analysis of those 7 subjects:

Subject ID Self Report F
8005964966_NA0054149_93-0143 Female -0.01959
8005964978_NA0054150_93-0144 Male 0.9893
8015124413_NA0055272_93-0623 Male 0.9827
8015124401_NA0055273_93-0624 Female -0.03714
8015124354_NA0055285_93-0628 Male 0.9861
8005965380_NA0055556_93-0720 Male 0.9888
8005964955_NA0056463_93-1068 Female 0.0587

I obtained this information from the file on S3 at s3://rti-midas-data/studies/vidus/observed/processing/Eric_Johnson_OmniExpress_re-analysis_A1_Manifest_QC-REPORT_AUG2017.txt

@ngaddis I didn't see in your methods file where you had applied the sex filter. Could you point me to a methods file you used to perform the sex check? Then I could compare our processing steps.

jaamarks commented 5 years ago

A couple errors were found after the GWAS was performed on the VIDUS data, therefore the GWAS was repeated. One error was in the phenotype file (PED file) appended to the rvtests for the GWAS. It was not constructed accurately. In particular, the age column contained the wrong data. This was corrected. The other error found was in selecting the model for rvtest. The initial run I performed had linear model selected when it should have been a logistic model since the case/control status of HIV acquisition is a binary variable. I corrected this as well.

The chrX data was also analyzed during this run. Below you will see the latest results. There was once SNP whose P-value was beyond the significance threshold of 5E-8. This SNP is rs58194125, and is in the gene DSCAM. You can find more information on this SNP here.

CHROM   POS     REF     ALT     N_INFORMATIVE   AF      INFORMATIVE_ALT_AC      CALL_RATE       HWE_PVALUE      N_REF   N_HET   N_ALT   U_STAT  SQRT_V_STAT    ALT_EFFSIZE     PVALUE
21      42042450        G       A       938     0.0154051:0.0312787:0.00954234  28.9:15.827:13.073      1:1:1   1:1:1   920:239:681     18:14:4 0:0:0 8.62447  1.57832 3.46212 4.6465e-08

vidus ea 1000g oaall maf_gt_0 01 rsq_gt_0 3 assoc plot all_chr snps indels manhattan vidus ea 1000g oaall maf_gt_0 01 rsq_gt_0 3 assoc plot all_chr snps indels qq

jaamarks commented 5 years ago

Results comparison: my GWAS results using rvtests and Fang's GWAS results using ProbAbel. Note that Fang's plots were retrieved from S3 at the paths:

s3://rti-hiv/vidas_data/results/vidus.ea.palogist.status~snp+gender+age+EV.plots.snps+indels.manhattan.png 
s3://rti-hiv/vidas_data/results/vidus.ea.palogist.status~snp+gender+age+EV.plots.snps+indels.qq.png



My results (RVTESTS) Fang's results (ProbAbel)
vidus ea 1000g oaall maf_gt_0 01 rsq_gt_0 3 assoc plot all_chr snps indels manhattan image
vidus ea 1000g oaall maf_gt_0 01 rsq_gt_0 3 assoc plot all_chr snps indels qq image
jaamarks commented 5 years ago

Note that my males and females are coded the same as those of Fang's.

original data (GWAS-Cohort-n938_passed_g_qc_only_HIV\ status_baseline.csv)

gwas_code,female,hiv,ageatint
1,0,0,55
2,0,1,59
3,0,1,55
4,0,1,54
5,0,1,40
6,1,0,35
7,0,0,56
8,0,0,51

My phenotype file

IID HIV_ACQ AGE SEX PC9 PC4 PC1
8005965211_NA0055425_93-0001_8005965211_NA0055425_93-0001 0 55 0 -0.0181 -0.0017 -0.0073
8005964642_NA0053910_93-0002_8005964642_NA0053910_93-0002 1 59 0 0.0041 -0.0088 -0.0053
8005964654_NA0053915_93-0003_8005964654_NA0053915_93-0003 1 55 0 0.0191 -0.0025 4e-04
8005964666_NA0053918_93-0004_8005964666_NA0053918_93-0004 1 54 0 0.0103 -8e-04 0.0088
8005964583_NA0053919_93-0005_8005964583_NA0053919_93-0005 1 40 0 0.0068 -0.0033 0.0134
8005964584_NA0053917_93-0006_8005964584_NA0053917_93-0006 0 35 1 0.021 9e-04 -0.0084
8005964667_NA0053921_93-0007_8005964667_NA0053921_93-0007 0 56 0 -0.0479 -0.0039 0.1496
8005964619_NA0053923_93-0008_8005964619_NA0053923_93-0008 0 51 0 -0.0047 -8e-04 -0.0118

Fang's phenotype file

iid     hiv     gender  age     pc6     pc1     pc9     pc10
8005965211_NA0055425_93-0001_8005965211_NA0055425_93-0001       0       0       55      -0.0009 -0.0029 -0.0455 0.0017
8005964642_NA0053910_93-0002_8005964642_NA0053910_93-0002       1       0       59      -0.0252 -0.0038 -0.0610 -0.0029
8005964654_NA0053915_93-0003_8005964654_NA0053915_93-0003       1       0       55      -0.0031 0.0006  0.0259  -0.0023
8005964666_NA0053918_93-0004_8005964666_NA0053918_93-0004       1       0       54      -0.0216 -0.0165 -0.0121 -0.0089
8005964583_NA0053919_93-0005_8005964583_NA0053919_93-0005       1       0       40      0.0996  0.2773  0.1130  0.1389
8005964584_NA0053917_93-0006_8005964584_NA0053917_93-0006       0       1       35      0.0024  -0.0125 0.0133  0.0105
8005964667_NA0053921_93-0007_8005964667_NA0053921_93-0007       0       0       56      0.0046  0.0143  -0.0326 -0.0088
8005964619_NA0053923_93-0008_8005964619_NA0053923_93-0008       0       0       51      0.0149  -0.0021 -0.0100 -0.0151



The phenotype files are in agreement with regards to sex. Females are coded as 1 and males coded as 0.

jaamarks commented 5 years ago

Removing sex discrepant samples: 7 samples had discordant sex classifications between the genotype and phenotype data. See Nathan's post in GitHub issue #42. Because of this, I am rerunning the baseline GWAS and the 2df (GxSex) GWAS with those samples removed. This brings up an important question regarding the PCs included in the model as covariates: is it OK to perform the GWAS by simply excluding these individuals and using the same model (i.e. include the same PCs as covariates)? Wouldn't we technically need to exclude those subjects from the original genotype that was then filtered by LD-pruning and subsequently used in Eigenstrat smartpca to obtain the PCs to include in the model as covariates? I think the vector space would be different since we are excluding those 7 samples, however would it be enough to have a significant impact on the PCs used in the model. I will consult with colleagues about this.

ngaddis commented 5 years ago

We typically regenerate the EVs using just the individuals that are being analyzed.


From: jaamarks notifications@github.com Sent: Thursday, February 21, 2019 11:11 AM To: RTIInternational/jaamarks_notebooks Cc: Gaddis, Nathan; Mention Subject: Re: [RTIInternational/jaamarks_notebooks] VIDUS: QC + GWAS (#6)

Removing sex discrepant samples: 7 samples had discordant sex classifications between the genotype and phenotype data. See Nathan's post in GitHub issue #42https://github.com/RTIInternational/bioinformatics/issues/42#issuecomment-465608600. Because of this, I am rerunning the baseline GWAS and the 2df (GxSex) GWAS with those samples removed. This brings up an important question regarding the PCs included in the model as covariates: is it OK to perform the GWAS by simply excluding these individuals and using the same model (i.e. include the same PCs as covariates)? Wouldn't we technically need to exclude those subjects from the original genotype that was then filtered by LD-pruning and subsequently used in Eigenstrat smartpca to obtain the PCs to include in the model as covariates? I think the vector space would be different since we are excluding those 7 samples, however would it be enough to have a significant impact on the PCs used in the model. I will consult with colleagues about this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/RTIInternational/jaamarks_notebooks/issues/6#issuecomment-466059624, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJskEf_R1p_CKXZJN8khVZW3b3f4b10Cks5vPsUwgaJpZM4Z75j1.

jaamarks commented 5 years ago




image



1df

vidus ea 1000g hiv_acq maf_gt_0 01 rsq_gt_0 3 assoc plot all_chr snps indels manhattan vidus ea 1000g hiv_acq maf_gt_0 01 rsq_gt_0 3 assoc plot all_chr snps indels qq



2df

vidus ea 2df 1000g hiv_acq maf_gt_0 01 rsq_gt_0 3 assoc plot all_chr snps indels manhattan vidus ea 2df 1000g hiv_acq maf_gt_0 01 rsq_gt_0 3 assoc plot all_chr snps indels qq



jaamarks commented 5 years ago

The EVs were generated per Nathan's recommendation above.

image

Based on these results we will use PC1, PC5, PC9. These PCs explain ~88% of the variance.

jaamarks commented 5 years ago

Here are the results for the baseline HIV acquisition GWAS of VIDUS (N=931).

image image



Results stored on S3: rti-hiv/vidus/results/1df/final

jaamarks commented 5 years ago

Here are the results for the GxSex joint 2df analysis of HIV acquisition of VIDUS (N=931).

image

image

Results stored on S3: rti-hiv/vidus/results/1df/final