samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
634 stars 241 forks source link

Inconsistent number of SNPs from bcftools stats and plot-vcfstats #2167

Closed YifanWei1115 closed 2 months ago

YifanWei1115 commented 2 months ago

Hi there, I've encountered some results that confuse me, since I did not find any interpretation of the output of stats. I ran the command bcftools stats HG001_ZJ1.sorted.markdup.BQSR.vcf -s - > HG001_ZJ1.sorted.markdup.BQSR.stats the output file .stats shows as follows (I just listed part of them):

# SN    [2]id   [3]key  [4]value
SN  0   number of samples:  1
SN  0   number of records:  9085058
SN  0   number of no-ALTs:  0
SN  0   number of SNPs: 6552678
SN  0   number of MNPs: 0
SN  0   number of indels:   2536999
SN  0   number of others:   1443
SN  0   number of multiallelic sites:   175734
SN  0   number of multiallelic SNP sites:   1589

And I ran the command plot-vcfstats HG001_ZJ1.sorted.markdup.BQSR.stats -p visual_stats

The number of SNPs for this sample in the output file (summary.pdf) was 3421056, which is different from one I get from the .stats file (number of SNPs: 6552678).

The source data used for plotting and the plot are as follows:

# [1]Sample ID  [2]ts/tv    [3]het/hom  [4]nSNPs    [5]nIndels  [6]Average depth    [7]nSingletons  [8]Sample name
0   2.037046    1.170186    3421056 910060  12.000000   4330276 HG001

image

Same situation with the number of indels (910060 from the plot v.s. 2536999 from .stats file):

image

I was wondering if you could tell me the difference between the two numbers of SNPs/InDels and why they were different.

By the way, I would appreciate it if you could add some simple interpretation of outputs for plot-stats in the documentation or elsewhere.

Thanks!

pd3 commented 2 months ago

Mmm, the summary stats gives the number of SNV and indel rows, but the per-sample numbers are actual genotypes. For example, a sample can have a homozygous reference genotype 0/0 in a SNV record.

If this does not resolve the problem, please attach the full stats file.