I have some bacterial NGS reads as well as assemblies. I used two methods to call variations:
mapping-based: I use clean reads with snippy and reference genome to call variation, and bcftools merge to combined the variations from multiple samples.
assembly-based: I assemblied the sample using shovill, annotated them using prokka, got pan-genome results using Panaroo, generated a recombination-free core-genes alignment using ClonalFrameML. Then I get the core genome variations using snp-sites by snp-sites -v -o ours_core_variations.vcf ../PGout_panaroo/core_gene_alignment_filtered.aln
Here is the statistics for two vcfs:
# for merged VCF from individually calling by snippy
1. $ bcftools stats ../Ours_vcf_merged.vcf.gz | grep SN
# SN, Summary numbers:
# number of SNPs .. number of rows with a SNP
# number of multiallelic SNP sites .. number of rows with multiple alternate alleles, all SNPs
# counter. For example, a row with a SNP and an indel increments both the SNP and
# SN [2]id [3]key [4]value
SN 0 number of samples: 196
SN 0 number of records: 2292
SN 0 number of no-ALTs: 0
SN 0 number of SNPs: 2053
SN 0 number of MNPs: 47
SN 0 number of indels: 180
SN 0 number of others: 13
SN 0 number of multiallelic sites: 13
SN 0 number of multiallelic SNP sites: 1
#for snp-sites results
2. $ bcftools stats ours_core_variations.vcf | grep SN
# SN, Summary numbers:
# number of SNPs .. number of rows with a SNP
# number of multiallelic SNP sites .. number of rows with multiple alternate alleles, all SNPs
# counter. For example, a row with a SNP and an indel increments both the SNP and
# SN [2]id [3]key [4]value
SN 0 number of samples: 196
SN 0 number of records: 5907
SN 0 number of no-ALTs: 0
SN 0 number of SNPs: 5907
SN 0 number of MNPs: 0
SN 0 number of indels: 0
SN 0 number of others: 0
SN 0 number of multiallelic sites: 3408
SN 0 number of multiallelic SNP sites: 34
There is a huge difference between the total variations (2292 vs. 5907) as well as SNP(2053 vs. 5907). Is there anything I did wrong? Could you please help me figure it out?
Hi,
I have some bacterial NGS reads as well as assemblies. I used two methods to call variations:
mapping-based
: I use clean reads withsnippy
andreference
genome to callvariation
, andbcftools merge
to combined the variations from multiple samples.assembly-based
: I assemblied the sample usingshovill
, annotated them usingprokka
, got pan-genome results usingPanaroo
, generated a recombination-free core-genes alignment usingClonalFrameML
. Then I get the core genomevariations
usingsnp-sites
bysnp-sites -v -o ours_core_variations.vcf ../PGout_panaroo/core_gene_alignment_filtered.aln
Here is the statistics for two vcfs:
There is a huge difference between the total
variations
(2292
vs.5907
) as well asSNP
(2053
vs.5907
). Is there anything I did wrong? Could you please help me figure it out?