vcftools / vcftools

A set of tools written in Perl and C++ for working with VCF files, such as those generated by the 1000 Genomes Project.
https://vcftools.github.io/
GNU Lesser General Public License v3.0
500 stars 148 forks source link

how to calculate 'N_VARIANTS' in windowed.pi results #85

Open conniecl opened 7 years ago

conniecl commented 7 years ago

Hi, While using vcftools for window-pi analysis, I found in the output, the number of N_VAEIANTS is different with my input VCF files. And here is my command: vcftools_0.1.13/bin/vcftools --vcf chr9.hmp3.vcf --window-pi 20000 --window-pi-step 2000 --out ./chr9.hmp3.20k And part of the output:

CHROM   BIN_START       BIN_END N_VARIANTS      PI
9       1       20000   446     0.00182799
9       2001    22000   495     0.00194225
9       4001    24000   548     0.00243987
9       6001    26000   568     0.00258492

in BIN 1-20000, the input contain 474 snp actually, instead 446 in the output file. I also check if it just use bi-allele, or filer some snp by missing rate, but can not find a rule about it. How the vcftools calculate the N_VARIANTS? Sincerely hope can get some advise, and thanks in advance.

jeff-sc-chu commented 3 years ago

I'm having the same issue. After looking at the code, here is what I learned:

unsigned int N_site_mismatches = 0;
for (vector<int>::iterator ac = allele_counts.begin(); ac != allele_counts.end(); ++ac)
{
    N_site_mismatches += (*ac * (N_non_missing_chr - *ac));
}
if (N_site_mismatches == 0)
    continue;   // Site is actually fixed.

If a site has a fixed allele, it is disregarded. A site is defined by allele count (AC) and non missing samples (not ./.). A quick way to check is to see if you have sites where AC = 2 * number of samples. (assuming none of the samples have missing data).