Unable to get per-sample counts with "bcftools stats"

samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html

http://samtools.github.io/bcftools/

Other

652 stars 240 forks source link

Unable to get per-sample counts with "bcftools stats" #671

Closed ceneg closed 7 years ago

ceneg commented 7 years ago

I have a problem outputing statistics from a cohort GATK VCF file (prepared following best practices, joint calling, haploid organism).

The command I am using is: bcftools stats -s- SNPs.vcf.gz > output.vchk

Everything is processed as expected until the per-sample counts part of the output. This contains the expected values for the sample ID and the depth, but all other values are 0 for all samples. Such as:

# PSC, Per-sample counts
# PSC   [2]id   [3]sample   [4]nRefHom  [5]nNonRefHom   [6]nHets    [7]nTransitions [8]nTransversions   [9]nIndels  [10]average depth   [11]nSingletons
PSC 0   1   0   0   0   0   0   0   41.6    0

The same happens with the file containing INDELs

bcftools query -l file.vcf correctly lists all the samples in the vcf file

I tried the same commands with the single-sample vcf file generated with GATK and the problem is the same. Am I missing something obvious?

pd3 commented 7 years ago

That's strange. Any chance you could provide a small test case to reproduce and debug the problem?

ceneg commented 7 years ago

I extracted a subset of the SNPs containing VCF file with tabix and then manually transferred the file header from the original VCF file (editing the scaffold information appropriately). I hope this will be good enough?

I tested it with bcftools stats and I get the same results as with the original file (everything seems fine until the PSC part of the file). I am attaching the vchk result as well.

SNPsubset.zip

pd3 commented 7 years ago

Thank you for the data.

The current logic of the program is to include only diploid genotypes in the n*Hom and n*Het counters, but the data is all haploid. Also they are all reference, so transitions and transversions are zero as well.

Should haploid genotypes be counted in the n*Hom bins? I am not sure.

Note to ourselves/developers: If this should be modified/extended, consider also https://github.com/samtools/bcftools/issues/316

ceneg commented 7 years ago

Thank you for the quick response!!

The data is haploid, yes, because the organism (fungus) is haploid as well.

What I want to achieve here is simply get the simplest statistics, e.g. the total number of SNPs per each sample (re-sequenced genome) in the VCF file. I thought "bcftools stats" would be a good tool to use for this. Does this not make sense biologically in some way I cannot see?

pd3 commented 7 years ago

It does make a biological sense and it should be supported.

EDIT: Just remembered what was the reason for not including haploid genotypes in the Hom bins. We want to know what is the het vs hom rate in diploid genomes. So if per-sample haploid genotypes should be counted, new columns should be created for that purpose.

ceneg commented 7 years ago

That makes sense. I am aware that working with haploid organisms is a special case (but it's not really uncommon, either).

Thank you for your support! In the mean time I'll try as a workaround splitting the VCF into per-sample VCFs (bcftools view -Oz -s "SampleID" file.vcf.gz > sampleID.vcf.gz) and then running the bcftools stats on each of them. Edit: the workaround does not work, because one still gets the SNP count for the whole cohort even when analysing the per-sample VCFs.

pd3 commented 7 years ago

I just pushed a commit which adds the haploid counts to the output. Please let me know in case of problems. Thanks for reporting the issue.

ceneg commented 7 years ago

I tested the commit on my dataset and it seems to work as expected.

I get two new columns in the vchk file: nHapRef and nHapAlt. Can you just confirm if I understand this correctly? Is nHapRef the count of reference variant positions covered by the sample and nHapAlt the count of positions with an alternate alelle (compared to the reference)?

Thanks again!

pd3 commented 7 years ago

Yes, there "Ref" stands for the reference allele (GT=0) and "Alt" for the alternate allele (GT=1 or 2, ...).

ceneg commented 7 years ago

Perfect, thanks!