Closed jaamarks closed 5 years ago
Refer to GitHub issue #117
These data were genotyped on the Illumina InfiniumOmni2-5-8v1-4_A1 chip.
These SNPs are in build GRCh37 and the variant names are from b150.
There are 1938 initial subjects genotyped (not including positive controls).
Nathan Gaddis pointed me to a methods file that was used to process UHS2 here. This file has an example of how to convert raw genotype data to tfam/tped format, and subsequently to bed/bim/fam format.
Nathan also suggested that I don't need to worry about converting from b150 to b138. Slack message on Dec 4, 2018:
I think it's actually fine to leave it in b150 actually.
It's not really necessary - when we run the imputation on MIS, the IDs get dropped anyway. We then add back in the IDs using the convert_to_1000g_p3_ids.pl script
It's more accurate to just leave it in b150 prior to the imputation
Because there might be mismappings from b138 that were corrected in b150
Because we do not have the phenotype data at the moment, we will skip the ancestry partitioning section during the genotype processing. We will perform the Structure analysis based on no self-report information and employ the typical thresholds of:
Action Description | Thresholding Criteria |
---|---|
For AA retainment | (AFR > 25%) ∧ (EAS < 25%) |
For EA retainment | (AFR < 25%) ∧ (EAS < 25%) |
For HA retainment | (AFR < 25%) ∧ (EAS > 25%) |
Ancestry | Subject Count | Retainment Thresholds |
---|---|---|
EA | 450 | (AFR < 25%) ∧ (EAS < 25%) |
AA | 1059 | (AFR > 25%) ∧ (EAS < 25%) |
HA | 8 | (AFR < 25%) ∧ (EAS > 25%) |
I am experiencing issues with these UHS4 data during two of the filtering steps, thus far. I am needing advice on how to best proceed.
With the typical variant call rate threshold of 0.03 applied, we lost 874,832 variants in the EA group and 1,741,821 variants in the AA group. I subsequently applied a less stringent threshold of 0.10, but we still lost ~275K for the EAs and ~half for the AAs.
Ancestry | Call-rate filter | Pre-filtered variant count | Post-filtered variant count |
---|---|---|---|
EA | 0.03 | 2,081,455 | 1,206,623 |
EA | 0.10 | 2,081,455 | 1,803,194 |
AA | 0.03 | 2,081,455 | 339,634 |
AA | 0.10 | 2,081,455 | 1,006,288 |
Losing all of the AA subjects, for some reason. I presume it could be due to downstream effects from the large amount of variants filtered out after the genotype call-rate filter was applied.
Ancestry | Pre-IBS subject count | Post-IBS subject count |
---|---|---|
EA | 415 | 399 |
AA | 940 | 1 |
We decided to apply some additional filters pre-STRUCTURE analysis in an attempt to clean up some of the data. See main GitHub issue #117 for more information. In essence, we apply two additional filters before the STRUCTURE analysis
We then select a 10,000 SNP subset for the STRUCTURE analysis. This appeared to clean up the data quite a bit (see plots below). Now we will move forward with the QC-pipeline post-STRUCTURE analysis.
Note that after speaking with Bryan Quach this morning, it was decided that for the post-STRUCTURE QC processing I should use the data that does not have the filters applied applied already. In particular, I will split up the subjects into their respective ancestral groups according to this latest round of STRUCTURE analysis. I will split up the data that has not had the call rate filter or the HWE filter applied. The argument for using these data instead of the filtered data that went into the STRUCTURE analysis is that maybe one ancestry has better quality data than another. It could be the case that the EA ancestry group had better call rates, for some reason—maybe processed on a separate chip—and so we don't necessarily want to use the pre-filtered data for them. Maybe some of the variant call rates were very low for the AA subjects and that was what was driving the low average call rate in the combined ancestry group data. So if we use the pre-filtered data, we might be able keep from filtering out some of the variants in one ancestry group that would have been filtered out in the combined ancestry group data.
It could also be argued that more stringent thresholds should be applied to the EA and AA ancestry groups because there are a few stray subjects that seem disparate from the others in their respective group.
Another thing to note is that it might be a good idea to consider implementing these two additional filters pre-STRUCTURE analysis in our QC-pipeline for every data set. By using cleaner data to extract a subset from for the ancestral analysis we might be able to more accurately assign subjects to their most appropriate ancestral group—considering how markedly these triangle plots were cleaned up after using the filtered data to subset the SNPs from for the STRUCTURE analysis.
UHS4 genotype data QC is now finished. Below is a breakdown of the filtering steps. In short, the final counts are:
Ancestry | Variant Count | Subject Count |
---|---|---|
AA | 946,932 | 657 |
EA | 1,154,359 | 622 |
The table below provides statistics on variants and subjects filtered during each step of the QC process before partitioning into ancestry groups.
Note: Initial subject counts are determined by considering only subjects with both genotype and phenotype data available. Consequently, initial numbers may not match between the .fam files and the phenotype files.
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects retained |
---|---|---|---|---|
Initial dbGaP dataset | 0 | 2,146,519 | 0 | 1,532 |
Remove positive controls | 0 | 2,146,519 | 15 | 1,517 |
Duplicate rsID filtering | 8,620 | 2,137,899 | 0 | 1,517 |
Genome build 37 & b138 update | 6,953 | 2,130,946 | 0 | 1,517 |
The table below provides statistics on variants and subjects filtered during each step of the QC process.
Note: Initial subject counts are determined by considering only subjects with both genotype and phenotype data available. Consequently, initial numbers may not match between the .fam files and the phenotype files
This table includes filtering statistics prior to merging with chrX.
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects added | Subjects retained |
---|---|---|---|---|---|
AA subject filter w/initial procedures (all chr) | 0 | 2,130,946 | 0 | 0 | 1,517 |
Partitioning to only autosomes | 46,889 | 2,084,057 | 0 | 0 | 1,517 |
Remove subjects missing whole autosome data | 0 | 2,084,057 | 0 | 0 | 1,517 |
Add + reassigned subjects based on STRUCTURE | 0 | 2,084,057 | 802 | 0 | 793 |
Strand flipping and polymorphic SNP filter | 2,602 | 2,081,455 | 0 | 0 | 793 |
Remove variants with missing call rate > 10% | 1,012,289 | 1,069,166 | 0 | 0 | 793 |
Remove variants with HWE p < 0.0001 | 101,443 | 967,723 | 0 | 0 | 793 |
This table includes filtering statistics prior to merging with autosomes.
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects added | Subjects retained |
---|---|---|---|---|---|
AA subject filter w/initial procedures (all chr) | 0 | 2,130,946 | 0 | 0 | 793 |
Partitioning to only chrX | 2,085,645 | 45,301 | 0 | 0 | 793 |
Remove subjects missing whole chrX data | 0 | 45,301 | 0 | 0 | 793 |
Duplicate rsID filtering | 0 | 45,301 | 0 | 0 | 793 |
Remove variants with missing call rate > 10% | 30,068 | 15,233 | 0 | 0 | 793 |
Remove variants with HWE p < 0.0001 | 10,989 | 4,244 | 0 | 0 | 793 |
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects added | Subjects retained |
---|---|---|---|---|---|
Merge autosomes and chrX | 0 | 971,967 | 0 | 0 | 793 |
Remove subjects with IBD > 0.4 or IBS > 0.9 | 0 | 971,967 | 42 | 0 | 751 |
Relationship inference using KING (AA only) | 0 | 971,967 | 0 | 0 | 751 |
Remove subjects with missing call rate > 10% | 0 | 971,967 | 94 | 0 | 657 |
Sex discordance filter | 0 | 971,967 | 0 | 0 | 657 |
Excessive homozygosity filter | 0 | 971,967 | 0 | 0 | 657 |
Duplicate variant ID filter after 1000G renaming | 1 | 971,966 | 0 | 0 | 657 |
Remove ambiguous SNPs | 25,034 | 946,932 | 0 | 0 | 657 |
The table below provides statistics on variants and subjects filtered during each step of the QC process.
Note: Initial subject counts are determined by considering only subjects with both genotype and phenotype data available. Consequently, initial numbers may not match between the .fam files and the phenotype files
This table includes filtering statistics prior to merging with chrX.
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects added | Subjects retained |
---|---|---|---|---|---|
EA subject filter w/initial procedures (all chr) | 0 | 2,130,946 | 0 | 0 | 1,517 |
Partitioning to only autosomes | 46,889 | 2,084,057 | 0 | 0 | 1,517 |
Remove subjects missing whole autosome data | 0 | 2,084,057 | 0 | 0 | 1,517 |
Add + reassigned subjects based on STRUCTURE | 0 | 2,084,057 | 802 | 0 | 715 |
Strand flipping and polymorphic SNP filter | 2,602 | 2,081,455 | 0 | 0 | 715 |
Remove variants with missing call rate > 10% | 788,520 | 1,292,935 | 0 | 0 | 715 |
Remove variants with HWE p < 0.0001 | 118,936 | 1,173,999 | 0 | 0 | 715 |
This table includes filtering statistics prior to merging with autosomes.
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects added | Subjects retained |
---|---|---|---|---|---|
EA subject filter w/initial procedures (all chr) | 0 | 2,130,946 | 0 | 0 | 715 |
Partitioning to only chrX | 2,084,057 | 45,301 | 0 | 0 | 715 |
Remove subjects missing whole chrX data | 0 | 45,301 | 0 | 0 | 715 |
Duplicate rsID filtering | 0 | 45,301 | 0 | 0 | 715 |
Remove variants with missing call rate > 10% | 24,463 | 20,838 | 0 | 0 | 715 |
Remove variants with HWE p < 0.0001 | 10,383 | 10,455 | 0 | 0 | 715 |
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects added | Subjects retained |
---|---|---|---|---|---|
Merge autosomes and chrX | 0 | 1,184,454 | 0 | 0 | 715 |
Remove subjects with IBD > 0.4 or IBS > 0.9 | 0 | 1,184,454 | 39 | 0 | 676 |
Remove subjects with missing call rate > 10% | 0 | 1,184,454 | 54 | 0 | 622 |
Sex discordance filter | 0 | 1,184,454 | 0 | 0 | 622 |
Excessive homozygosity filter | 0 | 1,184,454 | 0 | 0 | 622 |
Duplicate variant ID filter after 1000G renaming | 1 | 1,184,453 | 0 | 0 | 622 |
Remove ambiguous SNPs | 30,094 | 1,154,359 | 0 | 0 | 622 |
These observed genotype data have been copied to S3 at:
s3://rti-heroin/uhs4/data/genotype/observed/final/*
Eric Johnson has requested that I characterize the variants that were removed from the analysis for the call rate & HWE filters; here I will do that.
Lost 788,520 variants due to this filter (reduced to 1,292,935).
Variant Count on Variants Retained
RETAINED
chr1 variant count: 104601
chr2 variant count: 111204
chr3 variant count: 90553
chr4 variant count: 86361
chr5 variant count: 83921
chr6 variant count: 80128
chr7 variant count: 73270
chr8 variant count: 72076
chr9 variant count: 58100
chr10 variant count: 68307
chr11 variant count: 66248
chr12 variant count: 63789
chr13 variant count: 43043
chr14 variant count: 43690
chr15 variant count: 41256
chr16 variant count: 43294
chr17 variant count: 35685
chr18 variant count: 37765
chr19 variant count: 24821
chr20 variant count: 29751
chr21 variant count: 17839
chr22 variant count: 17233
Lost 118,936 variants due to this filter (reduced to 1,173,999).
RETAINED
chr1 variant count: 94422
chr2 variant count: 101326
chr3 variant count: 82795
chr4 variant count: 79280
chr5 variant count: 76619
chr6 variant count: 72827
chr7 variant count: 66474
chr8 variant count: 65695
chr9 variant count: 52926
chr10 variant count: 61840
chr11 variant count: 60148
chr12 variant count: 57987
chr13 variant count: 39343
chr14 variant count: 39728
chr15 variant count: 37472
chr16 variant count: 39065
chr17 variant count: 31760
chr18 variant count: 34437
chr19 variant count: 21790
chr20 variant count: 26633
chr21 variant count: 16180
chr22 variant count: 15252
Lost 1,012,289 variants due to this filter (reduced to 1,069,166).
Variant Count on Variants Retained
RETAINED
chr1 variant count: 86500
chr2 variant count: 92155
chr3 variant count: 75123
chr4 variant count: 70364
chr5 variant count: 69506
chr6 variant count: 66365
chr7 variant count: 60246
chr8 variant count: 59561
chr9 variant count: 47982
chr10 variant count: 57061
chr11 variant count: 54779
chr12 variant count: 52677
chr13 variant count: 35333
chr14 variant count: 36001
chr15 variant count: 34199
chr16 variant count: 36434
chr17 variant count: 29645
chr18 variant count: 31341
chr19 variant count: 20287
chr20 variant count: 24760
chr21 variant count: 14599
chr22 variant count: 14248
Lost 101,443 variants due to this filter (reduced to 967,723).
RETAINED
chr1 variant count: 77768
chr2 variant count: 83738
chr3 variant count: 68386
chr4 variant count: 64271
chr5 variant count: 63286
chr6 variant count: 60067
chr7 variant count: 54616
chr8 variant count: 54239
chr9 variant count: 43620
chr10 variant count: 51308
chr11 variant count: 49512
chr12 variant count: 47770
chr13 variant count: 32102
chr14 variant count: 32615
chr15 variant count: 30978
chr16 variant count: 32660
chr17 variant count: 26324
chr18 variant count: 28564
chr19 variant count: 17818
chr20 variant count: 22206
chr21 variant count: 13197
chr22 variant count: 12678
Here are bar plots, for each ancestry, that show the percent of SNPs filtered out after the call rate & HWE filters were applied (filtered-SNPs / Initial SNPs).
par(mfrow = c(2,1))
barplot(ea_percent, beside = TRUE,, main = "EA: Percent of SNPs Filtered per Chromosome",
ylab= "Percent Filtered", col=c("red", "blue"),
ylim = c(0,0.6), names.arg = ea_filtered$V1, las=2)
barplot(aa_percent, beside = TRUE,, main = "AA: Percent of SNPs Filtered per Chromosome",
ylab= "Percent Filtered", col=c("red", "blue"),
ylim = c(0,0.6), names.arg = aa_filtered$V1, las=2)
Here are the MAF distributions for the retained variants, by ancestry. Working on doing the same for the lost variants.
setwd("C:/Users/jmarks/Desktop/Projects/heroin/ngc/uhs4/filtered_snp_exploration/")
ea_maf <- read.table("ea/20181211-ea-maf_report-retained_snps.frq", header = TRUE)
aa_maf <- read.table("aa/20181211-aa-maf_report-retained_snps.frq", header = TRUE)
par(mfrow = c(2,1))
hist(ea_maf$MAF, col = "green", main = "EA MAF Distribution, Genome-Wide:\n Retain SNPs",
ylim = c(0,6e+05), xlab = "MAF")
hist(aa_maf$MAF, col = "green", main = "AA MAF Distribution, Genome-Wide:\n Retain SNPs",
ylim = c(0,6e+05), xlab = "MAF")
par(mfrow = c(2,1))
hist(ea_maf_fil$MAF, col = "blue", main = "EA MAF Distribution, Genome-Wide:\n Filtered SNPs",
ylim = c(0,1e+05), xlab = "MAF")
hist(aa_maf_fil$MAF, col = "blue", main = "AA MAF Distribution, Genome-Wide:\n Filtered SNPs",
ylim = c(0,1e+05), xlab = "MAF")
360,871 total HWE violation SNPs 76% gain SNPs 24% loss SNPs
275829
85042
2081455
436,389 total HWE violation SNPs 61.5% gain SNPs 38.5% loss SNPs
in_file = "20181211-hwe-report.hwe"
"""
CHR SNP TEST A1 A2 GENO O(HET) E(HET) P
1 rs4477212 ALL(NP) G A 4/159/502 0.2391 0.2196 0.02072
"""
with open(in_file) as inF:
head = inF.readline()
line = inF.readline()
#print(head)
#print(line)
gained = 0
loss = 0
total = 0
while line:
total += 1
sl = line.split()
pval = float(sl[8])
expected = float(sl[7])
observed = float(sl[6])
if pval <= 0.0001:
if observed > expected:
gained += 1
else:
loss += 1
line = inF.readline()
print(gained)
print(loss)
print(total)
268504
167885
2081455
TO-DO: Run STRUCTURE analysis with only the 10K subset from the filtered SNPs. Post the triangle plots.
Note I am having issues with getting the triangle plots generated. I have executed the script to submit the job to the scheduler, however the job is not being finished. Looking at the log files I see that it is executing up to:
Finished initialization; starting MCMC
1000 iterations + 1000 burnin
--------------------------------------------
Proportion of membership of each pre-defined
population in each of the 3 clusters
Given Inferred Clusters Number of
Pop 1 2 3 Individuals
1: 0.967 0.001 0.032 661
2: 0.000 1.000 0.000 504
3: 0.000 0.000 1.000 503
5: 0.192 0.079 0.729 715
--------------------------------------------
Rep#: Lambda Alpha F1 F2 F3 D1,2 D1,3 D2,3 Ln Like
100: 1.00 0.603 0.113 0.274 0.170 0.043 0.035 0.023 --
200: 1.00 0.405 0.119 0.279 0.170 0.043 0.035 0.023 --
300: 1.00 0.346 0.116 0.277 0.176 0.043 0.035 0.023 --
400: 1.00 0.329 0.118 0.278 0.173 0.043 0.035 0.023 --
500: 1.00 0.325 0.124 0.277 0.173 0.043 0.035 0.023 --
600: 1.00 0.300 0.116 0.275 0.175 0.043 0.035 0.023 --
700: 1.00 0.314 0.116 0.281 0.173 0.043 0.035 0.023 --
But not getting past this point. So, there are a couple things to note. Firstly, in this log file you will see that it is noticing the population 1,2,3, & 5. It is skipping 4. The reason for this is because of a small blunder when creating the input file. I went back and created the input file again from the ped/map files before submitting the job to the scheduler again. The other thing to note is that when I look in the input file, I see for the UHS4 subjects there are many entries with -9—after an brief exploration of the data, the last four subjects were missing ~25% of the genotype entries. Now note that these variants I am using for the STRUCTURE plot are a 10K subset from the SNPs which were I identified for removal due to missing call rate >= 10% as well as HWE filter. So with all of this in mind, it may be that the high number of missing genotype entries is what is causing the STRUCTURE issues. I may be that I have to resample from the filtered SNPs in an attempt to get 10K sample with fewer missing genotype entries. I might even have to code so that the sample only contains SNPs in the 10K subset such that no more than 5% of the SNPs have missing information. STRUCTURE assert that 10K SNPs is a large enough number of genotype calls for reasonable results. Well, my samples only 75% of that 10K (~7,500) so that I likely won't get reasonable results. This might what is causing the issues.
The UHS4 data were genotyped again because of errors at Rutgers. I have downloaded the new data and uploaded it to S3 at:
s3://rti-heroin/uhs4/data/genotype/observed/20190214/unprocessed
Results from STRUCTURE analysis of UHS4 genotype data. These data were analyzed using the 1000G reference panel.
Ancestry | Subject Count | Retainment Thresholds |
---|---|---|
EA | 896 | (AFR < 25%) ∧ (EAS < 25%) |
AA | 603 | (AFR > 25%) ∧ (EAS < 25%) |
HA | 430 | (AFR < 25%) ∧ (EAS > 25%) |
total: 1929
Subjects not assigned to an ancestry group:
num ID.y pop cluster1(AFR) cluster2(EAS) cluster3(EUR)
1683 AS88-2240_8002023110_HHG10458_11_G02 UHS4 0.377 0.548 0.075
1706 AS87-4287_8002023687_HHG10370_11_F05 UHS4 0.013 0.25 0.737
2045 AS00-00631_8002022318_HHG10080_15_D03 UHS4 0.173 0.25 0.576
2141 AS99-11673_8002688076_HHG5163_16_E03 UHS4 0.414 0.429 0.157
2352 AS97-02108_8002688748_HHG3236_18_B07 UHS4 0.341 0.431 0.228
2793 AS95-04523_8002023388_HHG9427_23_H02 UHS4 0.195 0.25 0.555
2844 AS97-08067_8002025595_HHG9707_23_C09 UHS4 0.092 0.25 0.658
2888 AS88-5289_8002023160_HHG9328_24_H02 UHS4 0.128 0.25 0.622
3146 AS98-12324_8002022889_HHG9214_26_D11 UHS4 0.069 0.25 0.68
total: 9
The above processing was done on the UHS4 smokescreen set of data. What we actually want is to merge the two sets of UHS4 data, smokescreen & heroin.
Upon processing the two datasets, I found that the smokescreen group is missing 32 samples in the genotype report that were listed in the QC report. Those samples are:
AS00-02677_8002022296_HHG10094_14_F11
AS00-02681_8002022332_HHG10097_14_G11
AS00-09056_8002022964_HHG10158_14_H11
AS00-09166_8002022941_HHG10164_14_A12
AS01-02377_8002022385_HHG10230_14_B12
AS01-05575_8002022411_HHG10248_14_A11
AS01-07353_8002022353_HHG10259_14_B11
AS01-07372_8002022365_HHG10260_14_C11
AS01-09148_8002022427_HHG10281_14_D11
AS01-12312_8002022345_HHG10290_14_C12
AS87-2834_8002022407_HHG10311_14_F08
AS87-2872_8002022384_HHG10317_14_E12
AS87-3038_8002023682_HHG10330_14_C08
AS87-3080_8002023718_HHG10333_14_D08
AS87-3151_8002023766_HHG10337_14_E08
AS87-4250_8002023744_HHG10351_14_G08
AS87-4269_8002023686_HHG10362_14_A08
AS87-4296_8002023723_HHG10373_14_H08
AS88-2263_8002023134_HHG10460_14_F12
AS88-2269_8002023146_HHG10461_14_B08
AS90-2791_8002022840_HHG10028_14_D12
AS90-3680_8002022277_HHG10037_14_G12
AS98-08783_8002687449_HHG4707_17_A03
AS98-09858_8002687473_HHG4709_17_B03
AS98-09866_8002687497_HHG4711_17_C03
AS98-09868_8002687509_HHG4712_17_D03
AS98-09926_8002687462_HHG4717_17_E03
AS98-09931_8002687486_HHG4719_17_G03
AS99-02289_8002689777_HHG4380_17_F03
AS99-02293_8002689789_HHG4381_17_H03
AS99-10832_8002022326_HHG10049_14_E11
This was known beforehand though, because the folks at Rutgers emailed us about this.
From: Christian Bixby <bixby@dls.rutgers.edu>
Date: Wednesday, February 13, 2019 at 2:09 PM
To: Eric Johnson <ejohnson@rti.org>
Subject: RE: Delay in data delivery
Good Afternoon Eric,
Your Smokescreen data is now ready.
I was able to save 16 of the 48 samples that had chip scan failures.
There are still 32 samples that failed due to these chip failures.
Rescans usually salvage these, but it didn’t work for these.
We are going to repeat these samples from the beginning and send you the data once completed. It will take a couple of weeks most likely.
Results from the UHS4 merged heroin & smokescreen data sets.
Ancestry | Subject Count | Retainment Thresholds |
---|---|---|
EA | 1,395 | (AFR < 25%) ∧ (EAS < 25%) |
AA | 1650 | (AFR > 25%) ∧ (EAS < 25%) |
HA | 412 | (AFR < 25%) ∧ (EAS > 25%) |
total assigned: 3,457
12 Subjects not assigned to an ancestry group:
1767 AS00-05162_8002022286_HHG10109_13_D08_AS00-05162_8002022286_HHG10109_13_D08 UHS4 0.059 0.25 0.691
1803 AS00-06421_8002022299_HHG10118_11_G10_AS00-06421_8002022299_HHG10118_11_G10 UHS4 0.038 0.25 0.711
2239 AS01-08826_8002022414_HHG10272_12_E08_AS01-08826_8002022414_HHG10272_12_E08 UHS4 0.451 0.253 0.296
2857 AS88-0734_8002220848_HHG5989_31_F10_AS88-0734_8002220848_HHG5989_31_F10 UHS4 0.445 0.442 0.113
2930 AS88-2240_8002023110_HHG10458_11_G02_AS88-2240_8002023110_HHG10458_11_G02 UHS4 0.453 0.515 0.031
3332 AS90-7347_8002025180_HHG9839_21_A09_AS90-7347_8002025180_HHG9839_21_A09 UHS4 0.071 0.25 0.679
3521 AS92-3984_8002023083_HHG9482_24_D09_AS92-3984_8002023083_HHG9482_24_D09 UHS4 0.46 0.26 0.281
3754 AS93-5421_8002023334_HHG9375_26_C04_AS93-5421_8002023334_HHG9375_26_C04 UHS4 0.031 0.25 0.719
4521 AS97-02108_8002688748_HHG3236_18_B07_AS97-02108_8002688748_HHG3236_18_B07 UHS4 0.326 0.444 0.23
4595 AS97-08174_8002025537_HHG9718_24_B08_AS97-08174_8002025537_HHG9718_24_B08 UHS4 0.05 0.25 0.7
4599 AS97-09289_8002025585_HHG9722_30_D04_AS97-09289_8002025585_HHG9722_30_D04 UHS4 0.195 0.25 0.556
5067 AS99-11673_8002688076_HHG5163_16_E03_AS99-11673_8002688076_HHG5163_16_E03 UHS4 0.4 0.478 0.123
Given that there are many fewer subjects assigned to the HAs than we were expecting, are there different thresholds I should apply to the data so that more of the EA classified subjects will be classified as HA for the STRUCTURE analysis?
Bryan Q. pointed me to the UHS4 phenotype data at the S3 location: s3://rti-heroin/hiv_all_merged_with_uhs_all_phenotype_data_08282017.csv.gz
All but six of the genotype IDs I was able to match to the phenotype file cell_line
column. Those IDs were:
genotype_id phenotype_column ancestry_selfreport
AS00-06677_8002022926_HHG10139_37_G08 serum 3
AS01-14267_8002022359_HHG10308_36_F12 serum 3
AS87-4277_8002690150_HHG1604_6_D09 serum 1
AS88-5278_8002690653_HHG1793_3_F06 serum 2
AS93-1569_8002690655_HHG1828_3_G07 gwasserum 2
AS93-1876_8002690620_HHG1842_3_A08 serum 1
Perl code that might come in handy later for printing the AS# followed by the cell_line ID when given the entire sample ID:
perl -lne '/(AS.+)(_800.+)(HH.+)(_\d*_)/; print $1."\t".$3' ids_kept_N3457_clean.txt > genotype.ids.AS.HH
Self-report AA, then EA, followed by HA.
STRUCTURE analysis triangle plots: UHS against 1000G
Table summarizing additional HIV status variables as well as viral load for the classified HAs. These results corroborate that the HIV case sample size is indeed small.
Subject Count | Retainment Thresholds | hiv | gwashiv | hivstat | hiv_status | viralload_cperml.x | viralload_cperml.y |
---|---|---|---|---|---|---|---|
418 | (AFR < 25%) ∧ (EAS > 25%) | 23 | 23 | 23 | 23 | 23 | 23 |
871 | (AFR < 25%) ∧ (EAS >= 8%) | 71 | 0 | 71 | 71 | 70 | 66 |
UHS4 HA GWAS model:
viralload_log10.y ~ SNP + aage + gwassex + syear_rec + class
ID | CHROM | POS | REF | ALT | N_INFORMATIVE | AF | INFORMATIVE_ALT_AC | CALL_RATE | HWE_PVALUE | N_REF | N_HET | N_ALT | U_STAT | SQRT_V_STAT | ALT_EFFSIZE | PVALUE |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rs146647111:31322224:A:AC | 6 | 31322224 | A | AC | 40 | 0.03075 | 2.46 | 1 | 1 | 38 | 2 | 0 | -2.44565 | 1.63315 | -0.916945 | 0.134262 |
rs1131446:31323116:C:T | 6 | 31323116 | C | T | 40 | 0.0385125 | 3.081 | 1 | 1 | 37 | 3 | 0 | -0.632343 | 1.95446 | -0.165539 | 0.746287 |
rs7047677:23084922:G:C | 9 | 23084922 | G | C | 40 | 0.898075 | 71.846 | 1 | 0.320825 | 1 | 6 | 33 | -1.5175 | 3.01976 | -0.166412 | 0.615299 |
rs143120449:28812768:G:T | 11 | 28812768 | G | T | 40 | 0.028425 | 2.274 | 1 | 1 | 38 | 2 | 0 | -0.610624 | 1.60853 | -0.236003 | 0.70423 |
rs73080309:25781909:G:A | 12 | 25781909 | G | A | 40 | 0.0624875 | 4.999 | 1 | 1 | 35 | 5 | 0 | 0.247328 | 2.38132 | 0.0436152 | 0.917279 |
rs7308030:116244327:T:C | 12 | 116244327 | T | C | 40 | 0.112512 | 9.001 | 1 | 0.396289 | 32 | 7 | 1 | 2.91931 | 3.59525 | 0.225851 | 0.416797 |
Note that there were only 40 subjects in this analysis.
Filtering Criterion | Subjects Removed | Total |
---|---|---|
HA classified subjects with viral load data | 0 | 65 |
Missing genotype data | 22 | 43 |
Missing missing age or sex | 3 | 40 |
We are investigating why so many subjects were removed due to IBS filtering. Using the standard thresholds for the STRUCTURE analysis, we remove too many subjects with the IBS filter. See table below.
Ancestry | Subject Count | Retainment Thresholds |
---|---|---|
EA | 1,650 | (AFR < 25%) ∧ (EAS < 25%) |
AA | 1,395 | (AFR > 25%) ∧ (EAS < 25%) |
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects retained |
---|---|---|---|---|
Merge autosomes and chrX | 0 | 2,056,398 | 0 | 1,650 |
Remove subjects with IBD > 0.4 or IBS > 0.9 | 0 | 2,056,398 | 886 | 764 |
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects retained |
---|---|---|---|---|
Merge autosomes and chrX | 0 | 1,669,722 | 0 | 1,395 |
Remove subjects with IBD > 0.4 or IBS > 0.9 | 0 | 1,669,722 | 116 | 1,279 |
We will investigate:
pmmn
personid
presid
date_acq
gwasdate_acq
gwaspresid
hivstat
hiv_status
wave.x
wave.y
ind_id
Note that I am not quite sure what information mapping the 886 EA classified subjects to the phenotype file will give us.
I repeated some of the the QC steps for UHS4—from post-STRUCTURE through IBS/IBD filtering—in an attempt to recapitulate the tables from above. The results were the same, however I discovered that the earlier posted table for the EAs had some entry errors. I address these entry errors in the tables below to more accurately reflect the QC statistics results.
This table includes filtering statistics prior to merging with chrX.
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects retained |
---|---|---|---|---|
subject filter w/initial procedures (all chr) | 0 | 2,367,429 | 0 | 3,469 |
Partitioning to only autosomes | 55,829 | 2,311,600 | 0 | 3,469 |
Remove subjects missing whole autosome data | 0 | 2,311,600 | 0 | 3,469 |
Add + reassigned subjects based on STRUCTURE | 0 | 2,311,600 | 802 | 1,650 |
Remove variants with missing call rate > 10% | 246,846 | 2,064,754 | 0 | 1,650 |
Remove variants with HWE p < 0.0001 | 44,918 | 2,019,836 | 0 | 1,650 |
Remove subjects with missing call rate > 10% | 0 | 2,019,836 | 78 | 1,572 |
Remove subjects with IBS > 0.9 | 0 | 2,019,836 | 886 | 686 |
Remove subjects with IBD > 0.4 | 0 | 2,019,836 | 0 | 686 |
This table includes filtering statistics prior to merging with chrX.
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects retained |
---|---|---|---|---|
subject filter w/initial procedures (all chr) | 0 | 2,367,429 | 0 | 3,469 |
Partitioning to only autosomes | 55,829 | 2,311,600 | 0 | 3,469 |
Remove subjects missing whole autosome data | 0 | 2,311,600 | 0 | 3,469 |
Add + reassigned subjects based on STRUCTURE | 0 | 2,311,600 | 2,074 | 1,395 |
Remove variants with missing call rate > 10% | 416,578 | 1,895,022 | 0 | 1,395 |
Remove variants with HWE p < 0.0001 | 252,879 | 1,642,143 | 0 | 1,395 |
Remove subjects with missing call rate > 10% | 0 | 1,669,722 | 149 | 1,246 |
Remove subjects with IBS > 0.9 | 0 | 1,669,722 | 116 | 1,130 |
Remove subjects with IBD > 0.4 | 0 | 1,669,722 | 0 | 1,130 |
Remove subjects with KING > 0.177 | 0 | 1,669,722 | 0 | 1,130 |
This table includes filtering statistics prior to merging with chrX.
QC procedure | Variants removed | Variants retained | Subjects removed | Subjects retained |
---|---|---|---|---|
subject filter w/initial procedures (all chr) | 0 | 2,367,429 | 0 | 3,469 |
Partitioning to only autosomes | 55,829 | 2,311,600 | 0 | 3,469 |
Remove subjects missing whole autosome data | 0 | 2,311,600 | 0 | 3,469 |
Add + reassigned subjects based on STRUCTURE | 0 | 2,311,600 | 3,051 | 418 |
Remove variants with missing call rate > 10% | 276,447 | 2,035,153 | 0 | 418 |
Remove variants with HWE p < 0.0001 | 12,330 | 2,022,823 | 0 | 418 |
Remove subjects with missing call rate > 10% | 0 | 2,022,823 | 12 | 406 |
Remove subjects with IBS > 0.9 | 0 | 2,022,823 | 152 | 254 |
Remove subjects with IBD > 0.4 | 0 | 2,022,823 | 0 | 254 |
We will proceed with UHS1.2.3. imputation + GWAS, since the UHS4 data seem to be corrupt in some way—what with the high number of IBS filtered subjects and other anomalous characteristics.
TO-DO