speciationgenomics / scripts

Scripts for analysis used during the course
78 stars 61 forks source link

Altering axes for checkHetsIndvVCF.sh #5

Closed sjfleck closed 3 years ago

sjfleck commented 4 years ago

When I use checkHetsIndvVCF.sh, my plots look similar to the examples on this tutorial (https://speciationgenomics.github.io/allelicBalance/). The only major difference is that I have some background noise in low genotype count that is expanding my x and y axis much further than where the majority of my data is clumped. I tried altering the script myself to limit the x and y axes on the left graph to get a better look at the majority of my data, but I'm not skilled in R and only came up with errors. How would I be able to alter my script so I could limit my x and Y axes to zoom in on my data? Thank you for any help you can provide

sjfleck commented 4 years ago

I want to make sure that I'm using your tool properly. I'm using a VCF file that contains data for 9 plants all from the same small geographical region and within the same genus. All individuals were mapped against the same reference also within the genus (MaSuRCA assembly, ~93% complete BUSCOs) and I used MarkDuplicates to remove PCR duplicates. I used GATK4 Haplotype Caller, followed by GenomicsDBimport, genotypeGVCFs, and gatherGVCFs to create the vcf file I'm using now. My ultimate goal is to get this data into TreeMix to look at admixture events between the populations. I'm relatively new to doing research with population structure, but I felt like your script would be useful for a quality control on my data.

I also want to mention that it seems like I have some allelic unbalance, so I used vcftools to extract a vcf file with only singletons and inputted that file into checkHetsIndvVCF.sh as well. I read that after doing this, "if the sites with highly unbalanced read counts disappear, this is evidence for PCR errors and against cross-contamination." Plotting only singletons did not make the patter disappear. Does this mean that my file contains contamination?

My collaborators have already done one round of filtering out untrustworthy contigs from the reference using Diamond (a BLASTing tool). My group decided on the less stringent filter, but if I have significant contamination (which I believe I do to varying degrees), I will propose to them that we need to go for the more stringent filter for choosing the contig subset for our reference. Thank you again for any insight you can provide. I'm attaching the checkHetsIndvVCF.sh outputs for a subsample of my VCF as well as a VCF with only singletons included.

checkHetsIndvVCF.sh (random subsample of VCF): Calophyllum_cGVCFs_subset.hetIndStats.pdf

checkHetsIndvVCF.sh (singletons only): Calophyllum_cGVCFs.mac1.hetIndStats.pdf

joanam commented 3 years ago

Hi sjfleck

Apologies for the late reply, I seem to have missed this issue. Judging from your plots it seems that you have mostly PCR errors leading to a second peak of heterozygote genotypes with highly unbalanced read proportions among your singletons. You could check if the second peak disappears if you only use SNPs with higher allele frequency. If that is the case, you can be sure that it is not contamination. For your further analyses, I would recommend that you remove singletons (e.g. ignore them if performing demographic modeling). For most analyses you will use a minor allele frequency cutoff anyways which will remove singletons. However, for phylogenetic tree inference such wrong singletons can lead to terminal branches and they may slightly inflate measures of genetic variance (e.g. pi). You may thus want to use my script to set heterozygote genotypes with bad allele balance to homozygotes for the major allele to remove that issue: https://github.com/joanam/scripts/blob/master/allelicBalance.py

It is a good idea to modify the checkHetsIndvVCF.sh script allowing the user to specify a custom maximum of the x axis. I have now added that as an optional parameter. To run it with the x axis extending only to 100 reads:

checkHetsIndvVCF.sh vcffile.vcf.gz 100

Best, Joana