Closed Rohit-Satyam closed 1 year ago
@jessicaway @alimanfoo Can you guys help me with this?
Hi @podpearson, would you be able to help here?
Dear @Rohit-Satyam
I think you might be seeing different numbers of SNPs/indels from what we report in the README because bcftools might treat spanning deletion alleles () as SNPs rather than indels (see https://github.com/samtools/bcftools/issues/736). You could perhaps try first creating a "SNPs only" VCF using the suggestion in the above link (`-e'ALT="" || type!="snp"' `) and then run bcftools stats on this new file to see if numbers of SNPs match what we report in the README.
I don't tend to monitor this github repo closely, so for future queries, it might be better to email data@malariagen.net which I should see straight away.
HTH, Richard
Yes, you are right. I did as you suggested using bcftools stats -e'ALT="*" || type!="snp"' filter.vcf.gz
and the number of SNPs now are 945,649
Dear MalariaGen Team
Apologies, if I am raising this question in wrong repository. I tried to download P. vivax variant data (PV4 Dataset) from https://www.malariagen.net/resource/30. I am actually interested in using this data for Base Quality Score Recalibration (BQSR) step of GATK germline variant calling pipeline. I tried to merge the Per-chromosome variant files and then I filtered the merged variant file for high confidence variants (SNPs and Indels) as recommended in the README file. The
bcftools stats
shows correct number to total variants and samples. However, the Number of SNPs reported in bcftools stats report are different than what has been reported in README. Can you help me understand why is it so. I am also sharing the code used belowAssembly used: PlasmoDB-50_PvivaxP01_Genome.fasta from PlasmoDB gatk4-4.2.6.1-1 bcftools: Version: 1.10.2 (using htslib 1.10.2-3) filter.bcftoolstats.txt