Closed ceneg closed 7 years ago
That's strange. Any chance you could provide a small test case to reproduce and debug the problem?
I extracted a subset of the SNPs containing VCF file with tabix and then manually transferred the file header from the original VCF file (editing the scaffold information appropriately). I hope this will be good enough?
I tested it with bcftools stats and I get the same results as with the original file (everything seems fine until the PSC part of the file). I am attaching the vchk result as well.
Thank you for the data.
The current logic of the program is to include only diploid genotypes in the n*Hom and n*Het counters, but the data is all haploid. Also they are all reference, so transitions and transversions are zero as well.
Should haploid genotypes be counted in the n*Hom bins? I am not sure.
Note to ourselves/developers: If this should be modified/extended, consider also https://github.com/samtools/bcftools/issues/316
Thank you for the quick response!!
The data is haploid, yes, because the organism (fungus) is haploid as well.
What I want to achieve here is simply get the simplest statistics, e.g. the total number of SNPs per each sample (re-sequenced genome) in the VCF file. I thought "bcftools stats" would be a good tool to use for this. Does this not make sense biologically in some way I cannot see?
It does make a biological sense and it should be supported.
EDIT: Just remembered what was the reason for not including haploid genotypes in the Hom bins. We want to know what is the het vs hom rate in diploid genomes. So if per-sample haploid genotypes should be counted, new columns should be created for that purpose.
That makes sense. I am aware that working with haploid organisms is a special case (but it's not really uncommon, either).
Thank you for your support! In the mean time I'll try as a workaround splitting the VCF into per-sample VCFs (bcftools view -Oz -s "SampleID" file.vcf.gz > sampleID.vcf.gz
) and then running the bcftools stats
on each of them. Edit: the workaround does not work, because one still gets the SNP count for the whole cohort even when analysing the per-sample VCFs.
I just pushed a commit which adds the haploid counts to the output. Please let me know in case of problems. Thanks for reporting the issue.
I tested the commit on my dataset and it seems to work as expected.
I get two new columns in the vchk file: nHapRef
and nHapAlt
. Can you just confirm if I understand this correctly? Is nHapRef
the count of reference variant positions covered by the sample and nHapAlt
the count of positions with an alternate alelle (compared to the reference)?
Thanks again!
Yes, there "Ref" stands for the reference allele (GT=0) and "Alt" for the alternate allele (GT=1 or 2, ...).
Perfect, thanks!
I have a problem outputing statistics from a cohort GATK VCF file (prepared following best practices, joint calling, haploid organism).
The command I am using is:
bcftools stats -s- SNPs.vcf.gz > output.vchk
Everything is processed as expected until the per-sample counts part of the output. This contains the expected values for the sample ID and the depth, but all other values are 0 for all samples. Such as:
The same happens with the file containing INDELs
bcftools query -l file.vcf
correctly lists all the samples in the vcf fileI tried the same commands with the single-sample vcf file generated with GATK and the problem is the same. Am I missing something obvious?