Closed sallycylau closed 3 years ago
This note was about parallelizing the calling code being difficult. You can safely split the calling into regions (e.g. chr1:1-1000000, chr1:1000001-2000000, etc.) and run separately, then merge the results afterwards; mpileup considers all reads overlapping the given region. This is how the program is routinely used.
Hi there
I am currently calling snps in ~400 samples using bcftools mpileup | bcftools call but it is running quite slow. I have seen others have tried to speed up the process via running mpileup/call in separate regions between multiple jobs.
However, I am working with target capture sequencing of RAD loci, meaning instead of mapping back my reads to a genome, I mapped my reads back to the consensus sequences of known RAD loci (but need further downstream filtering to detect and avoid linkage between RAD loci).
e.g. my output vcf CHROM column will be CHROM RADloc_001 RADloc_002 RADloc_003
I came across this old samtools thread (https://github.com/samtools/samtools/issues/480), and one of the comments says while dividing the mpileup/call jobs into regions is ok, "it's not trivial because neighbouring reads have an effect".
Because some of the RAD loci are potentially linked together, if I run multiple mpileup jobs by groups of RAD loci as "region", after I combined the vcf files into one, will this approach potentially affect my calls and linkage disequilibrium statistics?
Thanks a lot for your help. Sorry if this is an obviously stupid question!
Cheers Sally