sanger-pathogens / snp-sites

Finds SNP sites from a multi-FASTA alignment file
http://sanger-pathogens.github.io/snp-sites/
Other
232 stars 50 forks source link

snp-sites do not see do deletion events #94

Open bioinfornatics opened 3 years ago

bioinfornatics commented 3 years ago

Dear,

In order to validate snp-sitesin our lab we have compared the results with msa2vcf.jar

Thus we have took two close sequences from ncbi (ref NC_045512.2 and MT007544.1)

msa2vcf

$ java -jar dist/msa2vcf.jar --consensus 'NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome' --output ../sars_cov2.2.vcf ../sars_cov2.aln
[INFO][MsaToVcf]format : Fasta
$ cat ../sars_cov2.2.vcf
##fileformat=VCFv4.2
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##contig=<ID=chrUn,length=29903>
##msa2vcf.meta=compilation:20200728120720 githash:af51aa3 htsjdk:2.22.0 date:20200728122012 cmd:--consensus NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome --output ../sars_cov2.2.vcf ../sars_cov2.aln
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  MT007544.1 Severe acute respiratory syndrome coronavirus 2 isolate Australia/VIC01/2020, complete genome        NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
chrUn   19065   .       T       C       .       .       DP=2    GT:DP   1/1:1   0/0:1
chrUn   22303   .       T       G       .       .       DP=2    GT:DP   1/1:1   0/0:1
chrUn   26144   .       G       T       .       .       DP=2    GT:DP   1/1:1   0/0:1
chrUn   29749   .       ACGATCGAGTG     A       .       .       DP=2    GT:DP   1/1:1   0/0:1

snp-sites

$ snp-sites -c -v -o sars_cov2.vcf  sars_cov2.aln
$ cat sars_cov2.vcf
##fileformat=VCFv4.1
##contig=<ID=1,length=29903>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NC_045512.2     MT007544.1
1       19065   .       T       C       .       .       .       GT      0       1
1       22303   .       T       G       .       .       .       GT      0       1
1       26144   .       G       T       .       .       .       GT      0       1

Problem

snp-sites do not report the deletion

stitam commented 1 year ago

Hi,

I may have come across the same problem, I think snp-sites does not recognise the "-" character, for example:

seq1 ATGAAAATTGGATT seq2 ATGAAA-TTGGATT

snp-sites -v -o test.vcf test.aln

Here there is a deletion in position 7 in seq2, but snp-sites will not record it. Note, if I replace "-" with any other character e.g. "*", snp-sites will find the site. I am using the docker from staphb (https://hub.docker.com/r/staphb/snp-sites) which contains snp-sites v2.5.1.

I am wondering though, is this behaviour intended? Thanks.