sanger-pathogens / snp-sites

Finds SNP sites from a multi-FASTA alignment file
http://sanger-pathogens.github.io/snp-sites/
Other
232 stars 50 forks source link

why snp-sites generated variations numbers different snippy mapping results #113

Open Zjianglin opened 1 month ago

Zjianglin commented 1 month ago

Hi,

I have some bacterial NGS reads as well as assemblies. I used two methods to call variations:

  1. mapping-based: I use clean reads with snippy and reference genome to call variation, and bcftools merge to combined the variations from multiple samples.
  2. assembly-based: I assemblied the sample using shovill, annotated them using prokka, got pan-genome results using Panaroo, generated a recombination-free core-genes alignment using ClonalFrameML. Then I get the core genome variations using snp-sites by snp-sites -v -o ours_core_variations.vcf ../PGout_panaroo/core_gene_alignment_filtered.aln

Here is the statistics for two vcfs:

# for merged VCF from individually calling by snippy
1. $ bcftools stats ../Ours_vcf_merged.vcf.gz | grep SN
# SN, Summary numbers:
#   number of SNPs      .. number of rows with a SNP
#   number of multiallelic SNP sites .. number of rows with multiple alternate alleles, all SNPs
#   counter. For example, a row with a SNP and an indel increments both the SNP and
# SN    [2]id   [3]key  [4]value
SN  0   number of samples:  196
SN  0   number of records:  2292
SN  0   number of no-ALTs:  0
SN  0   number of SNPs: 2053
SN  0   number of MNPs: 47
SN  0   number of indels:   180
SN  0   number of others:   13
SN  0   number of multiallelic sites:   13
SN  0   number of multiallelic SNP sites:   1

#for snp-sites results
2. $ bcftools stats ours_core_variations.vcf | grep SN
# SN, Summary numbers:
#   number of SNPs      .. number of rows with a SNP
#   number of multiallelic SNP sites .. number of rows with multiple alternate alleles, all SNPs
#   counter. For example, a row with a SNP and an indel increments both the SNP and
# SN    [2]id   [3]key  [4]value
SN  0   number of samples:  196
SN  0   number of records:  5907
SN  0   number of no-ALTs:  0
SN  0   number of SNPs: 5907
SN  0   number of MNPs: 0
SN  0   number of indels:   0
SN  0   number of others:   0
SN  0   number of multiallelic sites:   3408
SN  0   number of multiallelic SNP sites:   34

There is a huge difference between the total variations (2292 vs. 5907) as well as SNP(2053 vs. 5907). Is there anything I did wrong? Could you please help me figure it out?