jaamarks commented 5 years ago

TO-DO

jaamarks commented 5 years ago

These data were genotyped on the Illumina InfiniumOmni2-5-8v1-4_A1 chip.

These SNPs are in build GRCh37 and the variant names are from b150.

There are 1938 initial subjects genotyped (not including positive controls).

Nathan Gaddis pointed me to a methods file that was used to process UHS2 here. This file has an example of how to convert raw genotype data to tfam/tped format, and subsequently to bed/bim/fam format.

Nathan also suggested that I don't need to worry about converting from b150 to b138. Slack message on Dec 4, 2018:

I think it's actually fine to leave it in b150 actually.
It's not really necessary - when we run the imputation on MIS, the IDs get dropped anyway. We then add back in the IDs using the convert_to_1000g_p3_ids.pl script
It's more accurate to just leave it in b150 prior to the imputation
Because there might be mismappings from b138 that were corrected in b150

jaamarks commented 5 years ago

Because we do not have the phenotype data at the moment, we will skip the ancestry partitioning section during the genotype processing. We will perform the Structure analysis based on no self-report information and employ the typical thresholds of:

Action Description	Thresholding Criteria
For AA retainment	(AFR > 25%) ∧ (EAS < 25%)
For EA retainment	(AFR < 25%) ∧ (EAS < 25%)
For HA retainment	(AFR < 25%) ∧ (EAS > 25%)

jaamarks commented 5 years ago

Pre-filtered (1,517 subjects)

Filtered

Ancestry	Subject Count	Retainment Thresholds
EA	450	(AFR < 25%) ∧ (EAS < 25%)
AA	1059	(AFR > 25%) ∧ (EAS < 25%)
HA	8	(AFR < 25%) ∧ (EAS > 25%)

jaamarks commented 5 years ago

I am experiencing issues with these UHS4 data during two of the filtering steps, thus far. I am needing advice on how to best proceed.

variant call rate—losing a large amount post-filtering
IBS filtering—losing all AAs

Variant Call Rate Filtering

With the typical variant call rate threshold of 0.03 applied, we lost 874,832 variants in the EA group and 1,741,821 variants in the AA group. I subsequently applied a less stringent threshold of 0.10, but we still lost ~275K for the EAs and ~half for the AAs.

Ancestry	Call-rate filter	Pre-filtered variant count	Post-filtered variant count
EA	0.03	2,081,455	1,206,623
EA	0.10	2,081,455	1,803,194
AA	0.03	2,081,455	339,634
AA	0.10	2,081,455	1,006,288

IBS Filtering

Losing all of the AA subjects, for some reason. I presume it could be due to downstream effects from the large amount of variants filtered out after the genotype call-rate filter was applied.

Ancestry	Pre-IBS subject count	Post-IBS subject count
EA	415	399
AA	940	1

jaamarks commented 5 years ago

We decided to apply some additional filters pre-STRUCTURE analysis in an attempt to clean up some of the data. See main GitHub issue #117 for more information. In essence, we apply two additional filters before the STRUCTURE analysis

genotype call rate filter of 10%
HWE filter of 0.0001

We then select a 10,000 SNP subset for the STRUCTURE analysis. This appeared to clean up the data quite a bit (see plots below). Now we will move forward with the QC-pipeline post-STRUCTURE analysis.

Note that after speaking with Bryan Quach this morning, it was decided that for the post-STRUCTURE QC processing I should use the data that does not have the filters applied applied already. In particular, I will split up the subjects into their respective ancestral groups according to this latest round of STRUCTURE analysis. I will split up the data that has not had the call rate filter or the HWE filter applied. The argument for using these data instead of the filtered data that went into the STRUCTURE analysis is that maybe one ancestry has better quality data than another. It could be the case that the EA ancestry group had better call rates, for some reason—maybe processed on a separate chip—and so we don't necessarily want to use the pre-filtered data for them. Maybe some of the variant call rates were very low for the AA subjects and that was what was driving the low average call rate in the combined ancestry group data. So if we use the pre-filtered data, we might be able keep from filtering out some of the variants in one ancestry group that would have been filtered out in the combined ancestry group data.

It could also be argued that more stringent thresholds should be applied to the EA and AA ancestry groups because there are a few stray subjects that seem disparate from the others in their respective group.

Another thing to note is that it might be a good idea to consider implementing these two additional filters pre-STRUCTURE analysis in our QC-pipeline for every data set. By using cleaner data to extract a subset from for the ancestral analysis we might be able to more accurately assign subjects to their most appropriate ancestral group—considering how markedly these triangle plots were cleaned up after using the filtered data to subset the SNPs from for the STRUCTURE analysis.

jaamarks commented 5 years ago

UHS4 genotype data QC is now finished. Below is a breakdown of the filtering steps. In short, the final counts are:

Ancestry	Variant Count	Subject Count
AA	946,932	657
EA	1,154,359	622

Pre-ancestry partitioning quality control sample tracking

The table below provides statistics on variants and subjects filtered during each step of the QC process before partitioning into ancestry groups.

Note: Initial subject counts are determined by considering only subjects with both genotype and phenotype data available. Consequently, initial numbers may not match between the .fam files and the phenotype files.

QC procedure	Variants removed	Variants retained	Subjects removed	Subjects retained
Initial dbGaP dataset	0	2,146,519	0	1,532
Remove positive controls	0	2,146,519	15	1,517
Duplicate rsID filtering	8,620	2,137,899	0	1,517
Genome build 37 & b138 update	6,953	2,130,946	0	1,517

African Ancestry

The table below provides statistics on variants and subjects filtered during each step of the QC process.

Note: Initial subject counts are determined by considering only subjects with both genotype and phenotype data available. Consequently, initial numbers may not match between the .fam files and the phenotype files

Autosome statistics

This table includes filtering statistics prior to merging with chrX.

QC procedure	Variants removed	Variants retained	Subjects removed	Subjects retained
AA subject filter w/initial procedures (all chr)	0	2,130,946	0	1,517
Partitioning to only autosomes	46,889	2,084,057	0	1,517
Remove subjects missing whole autosome data	0	2,084,057	0	1,517
Add + reassigned subjects based on STRUCTURE	0	2,084,057	802	793
Strand flipping and polymorphic SNP filter	2,602	2,081,455	0	793
Remove variants with missing call rate > 10%	1,012,289	1,069,166	0	793
Remove variants with HWE p < 0.0001	101,443	967,723	0	793

ChrX statistics

This table includes filtering statistics prior to merging with autosomes.

QC procedure	Variants removed	Variants retained	Subjects retained
AA subject filter w/initial procedures (all chr)	0	2,130,946	793
Partitioning to only chrX	2,085,645	45,301	793
Remove subjects missing whole chrX data	0	45,301	793
Duplicate rsID filtering	0	45,301	793
Remove variants with missing call rate > 10%	30,068	15,233	793
Remove variants with HWE p < 0.0001	10,989	4,244	793

Merged autosome and chrX statistics

QC procedure	Variants removed	Variants retained	Subjects removed	Subjects retained
Merge autosomes and chrX	0	971,967	0	793
Remove subjects with IBD > 0.4 or IBS > 0.9	0	971,967	42	751
Relationship inference using KING (AA only)	0	971,967	0	751
Remove subjects with missing call rate > 10%	0	971,967	94	657
Sex discordance filter	0	971,967	0	657
Excessive homozygosity filter	0	971,967	0	657
Duplicate variant ID filter after 1000G renaming	1	971,966	0	657
Remove ambiguous SNPs	25,034	946,932	0	657

European Ancestry

The table below provides statistics on variants and subjects filtered during each step of the QC process.

Note: Initial subject counts are determined by considering only subjects with both genotype and phenotype data available. Consequently, initial numbers may not match between the .fam files and the phenotype files

Autosome statistics

This table includes filtering statistics prior to merging with chrX.

QC procedure	Variants removed	Variants retained	Subjects removed	Subjects retained
EA subject filter w/initial procedures (all chr)	0	2,130,946	0	1,517
Partitioning to only autosomes	46,889	2,084,057	0	1,517
Remove subjects missing whole autosome data	0	2,084,057	0	1,517
Add + reassigned subjects based on STRUCTURE	0	2,084,057	802	715
Strand flipping and polymorphic SNP filter	2,602	2,081,455	0	715
Remove variants with missing call rate > 10%	788,520	1,292,935	0	715
Remove variants with HWE p < 0.0001	118,936	1,173,999	0	715

ChrX statistics

This table includes filtering statistics prior to merging with autosomes.

QC procedure	Variants removed	Variants retained	Subjects retained
EA subject filter w/initial procedures (all chr)	0	2,130,946	715
Partitioning to only chrX	2,084,057	45,301	715
Remove subjects missing whole chrX data	0	45,301	715
Duplicate rsID filtering	0	45,301	715
Remove variants with missing call rate > 10%	24,463	20,838	715
Remove variants with HWE p < 0.0001	10,383	10,455	715

Merged autosome and chrX statistics

QC procedure	Variants removed	Variants retained	Subjects removed	Subjects retained
Merge autosomes and chrX	0	1,184,454	0	715
Remove subjects with IBD > 0.4 or IBS > 0.9	0	1,184,454	39	676
Remove subjects with missing call rate > 10%	0	1,184,454	54	622
Sex discordance filter	0	1,184,454	0	622
Excessive homozygosity filter	0	1,184,454	0	622
Duplicate variant ID filter after 1000G renaming	1	1,184,453	0	622
Remove ambiguous SNPs	30,094	1,154,359	0	622

These observed genotype data have been copied to S3 at: s3://rti-heroin/uhs4/data/genotype/observed/final/*

jaamarks commented 5 years ago

Eric Johnson has requested that I characterize the variants that were removed from the analysis for the call rate & HWE filters; here I will do that.

EA

10% Call Rate Filter

Lost 788,520 variants due to this filter (reduced to 1,292,935).

Variant Count on Variants Retained

RETAINED
chr1 variant count: 104601
chr2 variant count: 111204
chr3 variant count: 90553
chr4 variant count: 86361
chr5 variant count: 83921
chr6 variant count: 80128
chr7 variant count: 73270
chr8 variant count: 72076
chr9 variant count: 58100
chr10 variant count: 68307
chr11 variant count: 66248
chr12 variant count: 63789
chr13 variant count: 43043
chr14 variant count: 43690
chr15 variant count: 41256
chr16 variant count: 43294
chr17 variant count: 35685
chr18 variant count: 37765
chr19 variant count: 24821
chr20 variant count: 29751
chr21 variant count: 17839
chr22 variant count: 17233

HWE Filter

Lost 118,936 variants due to this filter (reduced to 1,173,999).

RETAINED
chr1 variant count: 94422
chr2 variant count: 101326
chr3 variant count: 82795
chr4 variant count: 79280
chr5 variant count: 76619
chr6 variant count: 72827
chr7 variant count: 66474
chr8 variant count: 65695
chr9 variant count: 52926
chr10 variant count: 61840
chr11 variant count: 60148
chr12 variant count: 57987
chr13 variant count: 39343
chr14 variant count: 39728
chr15 variant count: 37472
chr16 variant count: 39065
chr17 variant count: 31760
chr18 variant count: 34437
chr19 variant count: 21790
chr20 variant count: 26633
chr21 variant count: 16180
chr22 variant count: 15252

AA

10% Call Rate Filter

Lost 1,012,289 variants due to this filter (reduced to 1,069,166).

Variant Count on Variants Retained

RETAINED
chr1 variant count: 86500
chr2 variant count: 92155
chr3 variant count: 75123
chr4 variant count: 70364
chr5 variant count: 69506
chr6 variant count: 66365
chr7 variant count: 60246
chr8 variant count: 59561
chr9 variant count: 47982
chr10 variant count: 57061
chr11 variant count: 54779
chr12 variant count: 52677
chr13 variant count: 35333
chr14 variant count: 36001
chr15 variant count: 34199
chr16 variant count: 36434
chr17 variant count: 29645
chr18 variant count: 31341
chr19 variant count: 20287
chr20 variant count: 24760
chr21 variant count: 14599
chr22 variant count: 14248

HWE Filter

Lost 101,443 variants due to this filter (reduced to 967,723).

RETAINED
chr1 variant count: 77768
chr2 variant count: 83738
chr3 variant count: 68386
chr4 variant count: 64271
chr5 variant count: 63286
chr6 variant count: 60067
chr7 variant count: 54616
chr8 variant count: 54239
chr9 variant count: 43620
chr10 variant count: 51308
chr11 variant count: 49512
chr12 variant count: 47770
chr13 variant count: 32102
chr14 variant count: 32615
chr15 variant count: 30978
chr16 variant count: 32660
chr17 variant count: 26324
chr18 variant count: 28564
chr19 variant count: 17818
chr20 variant count: 22206
chr21 variant count: 13197
chr22 variant count: 12678

jaamarks commented 5 years ago

Here are bar plots, for each ancestry, that show the percent of SNPs filtered out after the call rate & HWE filters were applied (filtered-SNPs / Initial SNPs).

par(mfrow = c(2,1))
barplot(ea_percent, beside = TRUE,, main = "EA: Percent of SNPs Filtered per Chromosome",
         ylab= "Percent Filtered", col=c("red", "blue"),
       ylim = c(0,0.6), names.arg = ea_filtered$V1, las=2)
barplot(aa_percent, beside = TRUE,, main = "AA: Percent of SNPs Filtered per Chromosome",
         ylab= "Percent Filtered", col=c("red", "blue"),
       ylim = c(0,0.6), names.arg = aa_filtered$V1, las=2)

jaamarks commented 5 years ago

Here are the MAF distributions for the retained variants, by ancestry. Working on doing the same for the lost variants.

setwd("C:/Users/jmarks/Desktop/Projects/heroin/ngc/uhs4/filtered_snp_exploration/")

ea_maf <- read.table("ea/20181211-ea-maf_report-retained_snps.frq", header = TRUE)
aa_maf <- read.table("aa/20181211-aa-maf_report-retained_snps.frq", header = TRUE)

par(mfrow = c(2,1))
hist(ea_maf$MAF, col = "green", main = "EA MAF Distribution, Genome-Wide:\n Retain SNPs",
     ylim = c(0,6e+05), xlab = "MAF")
hist(aa_maf$MAF, col = "green", main = "AA MAF Distribution, Genome-Wide:\n Retain SNPs",
     ylim = c(0,6e+05), xlab = "MAF")

jaamarks commented 5 years ago

par(mfrow = c(2,1))
hist(ea_maf_fil$MAF, col = "blue", main = "EA MAF Distribution, Genome-Wide:\n Filtered SNPs",
     ylim = c(0,1e+05), xlab = "MAF")
hist(aa_maf_fil$MAF, col = "blue", main = "AA MAF Distribution, Genome-Wide:\n Filtered SNPs",
     ylim = c(0,1e+05), xlab = "MAF")

jaamarks commented 5 years ago

EA

360,871 total HWE violation SNPs 76% gain SNPs 24% loss SNPs

275829
85042
2081455

AA

436,389 total HWE violation SNPs 61.5% gain SNPs 38.5% loss SNPs

in_file = "20181211-hwe-report.hwe"
"""
 CHR         SNP     TEST   A1   A2                 GENO   O(HET)   E(HET)            P

   1   rs4477212  ALL(NP)    G    A            4/159/502   0.2391   0.2196      0.02072

"""

with open(in_file) as inF:
    head = inF.readline()
    line = inF.readline()
    #print(head)
    #print(line)
    gained = 0
    loss = 0
    total = 0
    while line:
        total += 1
        sl = line.split()
        pval = float(sl[8])
        expected = float(sl[7])
        observed = float(sl[6])

        if pval <= 0.0001:
            if observed > expected:
                gained += 1
            else:
                loss += 1
        line = inF.readline()

print(gained)
print(loss)
print(total)

268504
167885
2081455

jaamarks commented 5 years ago

TO-DO: Run STRUCTURE analysis with only the 10K subset from the filtered SNPs. Post the triangle plots.

Note I am having issues with getting the triangle plots generated. I have executed the script to submit the job to the scheduler, however the job is not being finished. Looking at the log files I see that it is executing up to:

Finished initialization; starting MCMC
1000 iterations + 1000 burnin

--------------------------------------------
Proportion of membership of each pre-defined
 population in each of the 3 clusters

Given    Inferred Clusters       Number of
 Pop       1      2      3      Individuals

  1:     0.967  0.001  0.032      661
  2:     0.000  1.000  0.000      504
  3:     0.000  0.000  1.000      503
  5:     0.192  0.079  0.729      715
--------------------------------------------

 Rep#:   Lambda   Alpha      F1    F2    F3     D1,2  D1,3  D2,3    Ln Like
  100:    1.00    0.603    0.113 0.274 0.170    0.043 0.035 0.023   --
  200:    1.00    0.405    0.119 0.279 0.170    0.043 0.035 0.023   --
  300:    1.00    0.346    0.116 0.277 0.176    0.043 0.035 0.023   --
  400:    1.00    0.329    0.118 0.278 0.173    0.043 0.035 0.023   --
  500:    1.00    0.325    0.124 0.277 0.173    0.043 0.035 0.023   --
  600:    1.00    0.300    0.116 0.275 0.175    0.043 0.035 0.023   --
  700:    1.00    0.314    0.116 0.281 0.173    0.043 0.035 0.023   --

But not getting past this point. So, there are a couple things to note. Firstly, in this log file you will see that it is noticing the population 1,2,3, & 5. It is skipping 4. The reason for this is because of a small blunder when creating the input file. I went back and created the input file again from the ped/map files before submitting the job to the scheduler again. The other thing to note is that when I look in the input file, I see for the UHS4 subjects there are many entries with -9—after an brief exploration of the data, the last four subjects were missing ~25% of the genotype entries. Now note that these variants I am using for the STRUCTURE plot are a 10K subset from the SNPs which were I identified for removal due to missing call rate >= 10% as well as HWE filter. So with all of this in mind, it may be that the high number of missing genotype entries is what is causing the STRUCTURE issues. I may be that I have to resample from the filtered SNPs in an attempt to get 10K sample with fewer missing genotype entries. I might even have to code so that the sample only contains SNPs in the 10K subset such that no more than 5% of the SNPs have missing information. STRUCTURE assert that 10K SNPs is a large enough number of genotype calls for reasonable results. Well, my samples only 75% of that 10K (~7,500) so that I likely won't get reasonable results. This might what is causing the issues.

jaamarks commented 5 years ago

The UHS4 data were genotyped again because of errors at Rutgers. I have downloaded the new data and uploaded it to S3 at:

s3://rti-heroin/uhs4/data/genotype/observed/20190214/unprocessed

jaamarks commented 5 years ago

Results from STRUCTURE analysis of UHS4 genotype data. These data were analyzed using the 1000G reference panel.

Pre-filtered (1,938 subjects)

Filtered

Ancestry	Subject Count	Retainment Thresholds
EA	896	(AFR < 25%) ∧ (EAS < 25%)
AA	603	(AFR > 25%) ∧ (EAS < 25%)
HA	430	(AFR < 25%) ∧ (EAS > 25%)

total: 1929

Subjects not assigned to an ancestry group:

num ID.y pop cluster1(AFR) cluster2(EAS) cluster3(EUR)
1683 AS88-2240_8002023110_HHG10458_11_G02 UHS4 0.377 0.548 0.075
1706 AS87-4287_8002023687_HHG10370_11_F05 UHS4 0.013 0.25 0.737
2045 AS00-00631_8002022318_HHG10080_15_D03 UHS4 0.173 0.25 0.576
2141 AS99-11673_8002688076_HHG5163_16_E03 UHS4 0.414 0.429 0.157
2352 AS97-02108_8002688748_HHG3236_18_B07 UHS4 0.341 0.431 0.228
2793 AS95-04523_8002023388_HHG9427_23_H02 UHS4 0.195 0.25 0.555
2844 AS97-08067_8002025595_HHG9707_23_C09 UHS4 0.092 0.25 0.658
2888 AS88-5289_8002023160_HHG9328_24_H02 UHS4 0.128 0.25 0.622
3146 AS98-12324_8002022889_HHG9214_26_D11 UHS4 0.069 0.25 0.68

total: 9

jaamarks commented 5 years ago

The above processing was done on the UHS4 smokescreen set of data. What we actually want is to merge the two sets of UHS4 data, smokescreen & heroin.

jaamarks commented 5 years ago

Upon processing the two datasets, I found that the smokescreen group is missing 32 samples in the genotype report that were listed in the QC report. Those samples are:

AS00-02677_8002022296_HHG10094_14_F11
AS00-02681_8002022332_HHG10097_14_G11
AS00-09056_8002022964_HHG10158_14_H11
AS00-09166_8002022941_HHG10164_14_A12
AS01-02377_8002022385_HHG10230_14_B12
AS01-05575_8002022411_HHG10248_14_A11
AS01-07353_8002022353_HHG10259_14_B11
AS01-07372_8002022365_HHG10260_14_C11
AS01-09148_8002022427_HHG10281_14_D11
AS01-12312_8002022345_HHG10290_14_C12
AS87-2834_8002022407_HHG10311_14_F08
AS87-2872_8002022384_HHG10317_14_E12
AS87-3038_8002023682_HHG10330_14_C08
AS87-3080_8002023718_HHG10333_14_D08
AS87-3151_8002023766_HHG10337_14_E08
AS87-4250_8002023744_HHG10351_14_G08
AS87-4269_8002023686_HHG10362_14_A08
AS87-4296_8002023723_HHG10373_14_H08
AS88-2263_8002023134_HHG10460_14_F12
AS88-2269_8002023146_HHG10461_14_B08
AS90-2791_8002022840_HHG10028_14_D12
AS90-3680_8002022277_HHG10037_14_G12
AS98-08783_8002687449_HHG4707_17_A03
AS98-09858_8002687473_HHG4709_17_B03
AS98-09866_8002687497_HHG4711_17_C03
AS98-09868_8002687509_HHG4712_17_D03
AS98-09926_8002687462_HHG4717_17_E03
AS98-09931_8002687486_HHG4719_17_G03
AS99-02289_8002689777_HHG4380_17_F03
AS99-02293_8002689789_HHG4381_17_H03
AS99-10832_8002022326_HHG10049_14_E11

This was known beforehand though, because the folks at Rutgers emailed us about this.

From: Christian Bixby <bixby@dls.rutgers.edu>
Date: Wednesday, February 13, 2019 at 2:09 PM
To: Eric Johnson <ejohnson@rti.org>
Subject: RE: Delay in data delivery

Good Afternoon Eric,

Your Smokescreen data is now ready.

I was able to save 16 of the 48 samples that had chip scan failures.

There are still 32 samples that failed due to these chip failures.

Rescans usually salvage these, but it didn’t work for these.

We are going to repeat these samples from the beginning and send you the data once completed.  It will take a couple of weeks most likely.

jaamarks commented 5 years ago

Results from the UHS4 merged heroin & smokescreen data sets.

Pre-filtered (3469 subjects)

Filtered

Ancestry	Subject Count	Retainment Thresholds
EA	1,395	(AFR < 25%) ∧ (EAS < 25%)
AA	1650	(AFR > 25%) ∧ (EAS < 25%)
HA	412	(AFR < 25%) ∧ (EAS > 25%)

total assigned: 3,457

12 Subjects not assigned to an ancestry group:

1767    AS00-05162_8002022286_HHG10109_13_D08_AS00-05162_8002022286_HHG10109_13_D08     UHS4    0.059   0.25    0.691
1803    AS00-06421_8002022299_HHG10118_11_G10_AS00-06421_8002022299_HHG10118_11_G10     UHS4    0.038   0.25    0.711
2239    AS01-08826_8002022414_HHG10272_12_E08_AS01-08826_8002022414_HHG10272_12_E08     UHS4    0.451   0.253   0.296
2857    AS88-0734_8002220848_HHG5989_31_F10_AS88-0734_8002220848_HHG5989_31_F10 UHS4    0.445   0.442   0.113
2930    AS88-2240_8002023110_HHG10458_11_G02_AS88-2240_8002023110_HHG10458_11_G02       UHS4    0.453   0.515   0.031
3332    AS90-7347_8002025180_HHG9839_21_A09_AS90-7347_8002025180_HHG9839_21_A09 UHS4    0.071   0.25    0.679
3521    AS92-3984_8002023083_HHG9482_24_D09_AS92-3984_8002023083_HHG9482_24_D09 UHS4    0.46    0.26    0.281
3754    AS93-5421_8002023334_HHG9375_26_C04_AS93-5421_8002023334_HHG9375_26_C04 UHS4    0.031   0.25    0.719
4521    AS97-02108_8002688748_HHG3236_18_B07_AS97-02108_8002688748_HHG3236_18_B07       UHS4    0.326   0.444   0.23
4595    AS97-08174_8002025537_HHG9718_24_B08_AS97-08174_8002025537_HHG9718_24_B08       UHS4    0.05    0.25    0.7
4599    AS97-09289_8002025585_HHG9722_30_D04_AS97-09289_8002025585_HHG9722_30_D04       UHS4    0.195   0.25    0.556
5067    AS99-11673_8002688076_HHG5163_16_E03_AS99-11673_8002688076_HHG5163_16_E03       UHS4    0.4     0.478   0.123

Given that there are many fewer subjects assigned to the HAs than we were expecting, are there different thresholds I should apply to the data so that more of the EA classified subjects will be classified as HA for the STRUCTURE analysis?

jaamarks commented 5 years ago

phenotype data information

Bryan Q. pointed me to the UHS4 phenotype data at the S3 location: s3://rti-heroin/hiv_all_merged_with_uhs_all_phenotype_data_08282017.csv.gz

jaamarks commented 5 years ago

All but six of the genotype IDs I was able to match to the phenotype file cell_line column. Those IDs were:

genotype_id     phenotype_column        ancestry_selfreport
AS00-06677_8002022926_HHG10139_37_G08   serum   3
AS01-14267_8002022359_HHG10308_36_F12   serum   3
AS87-4277_8002690150_HHG1604_6_D09      serum   1
AS88-5278_8002690653_HHG1793_3_F06      serum   2
AS93-1569_8002690655_HHG1828_3_G07      gwasserum       2
AS93-1876_8002690620_HHG1842_3_A08      serum   1

jaamarks commented 5 years ago

Code

Perl code that might come in handy later for printing the AS# followed by the cell_line ID when given the entire sample ID:

perl -lne '/(AS.+)(_800.+)(HH.+)(_\d*_)/; print $1."\t".$3' ids_kept_N3457_clean.txt > genotype.ids.AS.HH

jaamarks commented 5 years ago

ha plot_uhs4

jaamarks commented 5 years ago

Self-report AA, then EA, followed by HA.

aa plot_uhs4 ea plot_uhs4 ha plot_uhs4

jaamarks commented 5 years ago

STRUCTURE analysis triangle plots: UHS against 1000G

UHS1

EA

AA

UHS2

EA

AA

UHS3

V1-2

AA

EA

V1-3

AA

EA

jaamarks commented 5 years ago

With liberal thresholding, we classify 871 subjects as HA (812 of those were self-report HA).
With traditional thresholding, we classify 418 subjects as HA (409 of those were self-report HA).

Table summarizing additional HIV status variables as well as viral load for the classified HAs. These results corroborate that the HIV case sample size is indeed small.

Subject Count	Retainment Thresholds	hiv	gwashiv	hivstat	hiv_status	viralload_cperml.x	viralload_cperml.y
418	(AFR < 25%) ∧ (EAS > 25%)	23	23	23	23	23	23
871	(AFR < 25%) ∧ (EAS >= 8%)	71	0	71	71	70	66

jaamarks commented 5 years ago

UHS4 HA GWAS model: viralload_log10.y ~ SNP + aage + gwassex + syear_rec + class

ID	CHROM	POS	REF	ALT	N_INFORMATIVE	AF	INFORMATIVE_ALT_AC	CALL_RATE	HWE_PVALUE	N_REF	N_HET	N_ALT	U_STAT	SQRT_V_STAT	ALT_EFFSIZE	PVALUE
rs146647111:31322224:A:AC	6	31322224	A	AC	40	0.03075	2.46	1	1	38	2	0	-2.44565	1.63315	-0.916945	0.134262
rs1131446:31323116:C:T	6	31323116	C	T	40	0.0385125	3.081	1	1	37	3	0	-0.632343	1.95446	-0.165539	0.746287
rs7047677:23084922:G:C	9	23084922	G	C	40	0.898075	71.846	1	0.320825	1	6	33	-1.5175	3.01976	-0.166412	0.615299
rs143120449:28812768:G:T	11	28812768	G	T	40	0.028425	2.274	1	1	38	2	0	-0.610624	1.60853	-0.236003	0.70423
rs73080309:25781909:G:A	12	25781909	G	A	40	0.0624875	4.999	1	1	35	5	0	0.247328	2.38132	0.0436152	0.917279
rs7308030:116244327:T:C	12	116244327	T	C	40	0.112512	9.001	1	0.396289	32	7	1	2.91931	3.59525	0.225851	0.416797

uhs4 ha 1000G viral-load maf_gt_0 01 rsq_gt_0 3 assoc plot all_chr snps+indels manhattan

Note that there were only 40 subjects in this analysis.

Filtering Criterion	Subjects Removed	Total
HA classified subjects with viral load data	0	65
Missing genotype data	22	43
Missing missing age or sex	3	40

jaamarks commented 5 years ago

We are investigating why so many subjects were removed due to IBS filtering. Using the standard thresholds for the STRUCTURE analysis, we remove too many subjects with the IBS filter. See table below.

Ancestry	Subject Count	Retainment Thresholds
EA	1,650	(AFR < 25%) ∧ (EAS < 25%)
AA	1,395	(AFR > 25%) ∧ (EAS < 25%)

EA

QC procedure	Variants removed	Variants retained	Subjects removed	Subjects retained
Merge autosomes and chrX	0	2,056,398	0	1,650
Remove subjects with IBD > 0.4 or IBS > 0.9	0	2,056,398	886	764

AA

QC procedure	Variants removed	Variants retained	Subjects removed	Subjects retained
Merge autosomes and chrX	0	1,669,722	0	1,395
Remove subjects with IBD > 0.4 or IBS > 0.9	0	1,669,722	116	1,279

We will investigate:

[x] Double check that the threshold used for the IBS filtering was indeed 0.9
[x] make sure there are no duplicated IDs in the genotype data
[ ] map 886 EA classified subjects to the phenotype file. EOJ will provide the variables to map to
```
pmmn
personid
presid
date_acq
gwasdate_acq
gwaspresid
hivstat
hiv_status
wave.x
wave.y
ind_id
```
[ ] Look at the distribution of the IBS values
[ ] compare IDs of old vs new
- old genotype data no longer available. was deleted after learning that those data were incorrect and after receiving the new data. Did this in an attempt to save money buy storing data that was problematic

Note that I am not quite sure what information mapping the 886 EA classified subjects to the phenotype file will give us.

jaamarks commented 5 years ago

I repeated some of the the QC steps for UHS4—from post-STRUCTURE through IBS/IBD filtering—in an attempt to recapitulate the tables from above. The results were the same, however I discovered that the earlier posted table for the EAs had some entry errors. I address these entry errors in the tables below to more accurately reflect the QC statistics results.

EA

Autosome statistics

This table includes filtering statistics prior to merging with chrX.

QC procedure	Variants removed	Variants retained	Subjects removed	Subjects retained
subject filter w/initial procedures (all chr)	0	2,367,429	0	3,469
Partitioning to only autosomes	55,829	2,311,600	0	3,469
Remove subjects missing whole autosome data	0	2,311,600	0	3,469
Add + reassigned subjects based on STRUCTURE	0	2,311,600	802	1,650
Remove variants with missing call rate > 10%	246,846	2,064,754	0	1,650
Remove variants with HWE p < 0.0001	44,918	2,019,836	0	1,650
Remove subjects with missing call rate > 10%	0	2,019,836	78	1,572
Remove subjects with IBS > 0.9	0	2,019,836	886	686
Remove subjects with IBD > 0.4	0	2,019,836	0	686

AA

Autosome statistics

This table includes filtering statistics prior to merging with chrX.

QC procedure	Variants removed	Variants retained	Subjects removed	Subjects retained
subject filter w/initial procedures (all chr)	0	2,367,429	0	3,469
Partitioning to only autosomes	55,829	2,311,600	0	3,469
Remove subjects missing whole autosome data	0	2,311,600	0	3,469
Add + reassigned subjects based on STRUCTURE	0	2,311,600	2,074	1,395
Remove variants with missing call rate > 10%	416,578	1,895,022	0	1,395
Remove variants with HWE p < 0.0001	252,879	1,642,143	0	1,395
Remove subjects with missing call rate > 10%	0	1,669,722	149	1,246
Remove subjects with IBS > 0.9	0	1,669,722	116	1,130
Remove subjects with IBD > 0.4	0	1,669,722	0	1,130
Remove subjects with KING > 0.177	0	1,669,722	0	1,130

HA

Autosome statistics

This table includes filtering statistics prior to merging with chrX.

QC procedure	Variants removed	Variants retained	Subjects removed	Subjects retained
subject filter w/initial procedures (all chr)	0	2,367,429	0	3,469
Partitioning to only autosomes	55,829	2,311,600	0	3,469
Remove subjects missing whole autosome data	0	2,311,600	0	3,469
Add + reassigned subjects based on STRUCTURE	0	2,311,600	3,051	418
Remove variants with missing call rate > 10%	276,447	2,035,153	0	418
Remove variants with HWE p < 0.0001	12,330	2,022,823	0	418
Remove subjects with missing call rate > 10%	0	2,022,823	12	406
Remove subjects with IBS > 0.9	0	2,022,823	152	254
Remove subjects with IBD > 0.4	0	2,022,823	0	254

jaamarks commented 5 years ago

We will proceed with UHS1.2.3. imputation + GWAS, since the UHS4 data seem to be corrupt in some way—what with the high number of IBS filtered subjects and other anomalous characteristics.

jaamarks / jaamarks_notebooks

UHS4 data: QC + Imputation + GWAS #5

Pre-filtered (1,517 subjects)

Filtered

Variant Call Rate Filtering

IBS Filtering

Pre-ancestry partitioning quality control sample tracking

African Ancestry

Autosome statistics

ChrX statistics

Merged autosome and chrX statistics

European Ancestry

Autosome statistics

ChrX statistics

Merged autosome and chrX statistics

EA

10% Call Rate Filter

HWE Filter

AA

10% Call Rate Filter

HWE Filter

EA

AA

Pre-filtered (1,938 subjects)

Filtered

Pre-filtered (3469 subjects)

Filtered

phenotype data information

Code

UHS1

EA

AA

UHS2

EA

AA

UHS3

V1-2

AA

EA

V1-3

AA

EA

EA

AA

EA

Autosome statistics

AA

Autosome statistics

HA

Autosome statistics