malariagen / pipelines

Pipelines for processing malaria parasite and mosquito genome sequence data.
MIT License
14 stars 13 forks source link

Number of SNPs and Indels doesn't match with the numbers reported in README files #109

Closed Rohit-Satyam closed 1 year ago

Rohit-Satyam commented 1 year ago

Dear MalariaGen Team

Apologies, if I am raising this question in wrong repository. I tried to download P. vivax variant data (PV4 Dataset) from https://www.malariagen.net/resource/30. I am actually interested in using this data for Base Quality Score Recalibration (BQSR) step of GATK germline variant calling pipeline. I tried to merge the Per-chromosome variant files and then I filtered the merged variant file for high confidence variants (SNPs and Indels) as recommended in the README file. The bcftools stats shows correct number to total variants and samples. However, the Number of SNPs reported in bcftools stats report are different than what has been reported in README. Can you help me understand why is it so. I am also sharing the code used below

###############Bcftools #################### SN [2]id [3]key [4]value SN 0 number of samples: 1895 SN 0 number of records: 1303984 SN 0 number of no-ALTs: 0 SN 0 number of SNPs: 1117804 SN 0 number of MNPs: 0 SN 0 number of indels: 254148 SN 0 number of others: 0 SN 0 number of multiallelic sites: 331817 SN 0 number of multiallelic SNP sites: 33748

##############################README The VCF files contains details of 4,571,056 discovered variant genome positions. These variants were discovered amongst all samples from the release. 3,083,454 of these variant positions are SNPs, with the remainder being either short insertion/deletions (indels), or a combination of SNPs and indels. It is important to note that many of these variants are considered low quality. Only the variants for which the FILTER column is set to PASS should be considered of reasonable quality. There are 1,303,984 such PASS variants of which 945,649 are SNPs and 358,335 indels.

wget ftp://ngs.sanger.ac.uk/production/malaria/Resource/30/Pv4_vcf/*
ls -1 *.vcf.gz > vcf.list
#merging vcfs
picard GatherVcfs I=vcf.list O=merged.vcf.gz
gatk IndexFeatureFile -I merged.vcf.gz 

# using only high confidence variants for BQSR
gatk SelectVariants -R ../results/00_indexes/bwaidx/PlasmoDB-50_PvivaxP01_Genome.fasta -V merged.vcf.gz --exclude-filtered true -O filter.vcf.gz
## Complete output is attached below
bcftools stats  filter.vcf.gz > filter.bcftoolstats.txt 

Assembly used: PlasmoDB-50_PvivaxP01_Genome.fasta from PlasmoDB gatk4-4.2.6.1-1 bcftools: Version: 1.10.2 (using htslib 1.10.2-3) filter.bcftoolstats.txt

Rohit-Satyam commented 1 year ago

@jessicaway @alimanfoo Can you guys help me with this?

alimanfoo commented 1 year ago

Hi @podpearson, would you be able to help here?

podpearson commented 1 year ago

Dear @Rohit-Satyam

I think you might be seeing different numbers of SNPs/indels from what we report in the README because bcftools might treat spanning deletion alleles () as SNPs rather than indels (see https://github.com/samtools/bcftools/issues/736). You could perhaps try first creating a "SNPs only" VCF using the suggestion in the above link (`-e'ALT="" || type!="snp"' `) and then run bcftools stats on this new file to see if numbers of SNPs match what we report in the README.

I don't tend to monitor this github repo closely, so for future queries, it might be better to email data@malariagen.net which I should see straight away.

HTH, Richard

Rohit-Satyam commented 1 year ago

Yes, you are right. I did as you suggested using bcftools stats -e'ALT="*" || type!="snp"' filter.vcf.gz and the number of SNPs now are 945,649