samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
673 stars 240 forks source link

consequence of running mpileup on separate "regions"? #1406

Closed sallycylau closed 3 years ago

sallycylau commented 3 years ago

Hi there

I am currently calling snps in ~400 samples using bcftools mpileup | bcftools call but it is running quite slow. I have seen others have tried to speed up the process via running mpileup/call in separate regions between multiple jobs.

However, I am working with target capture sequencing of RAD loci, meaning instead of mapping back my reads to a genome, I mapped my reads back to the consensus sequences of known RAD loci (but need further downstream filtering to detect and avoid linkage between RAD loci).

e.g. my output vcf CHROM column will be CHROM RADloc_001 RADloc_002 RADloc_003

I came across this old samtools thread (https://github.com/samtools/samtools/issues/480), and one of the comments says while dividing the mpileup/call jobs into regions is ok, "it's not trivial because neighbouring reads have an effect".

Because some of the RAD loci are potentially linked together, if I run multiple mpileup jobs by groups of RAD loci as "region", after I combined the vcf files into one, will this approach potentially affect my calls and linkage disequilibrium statistics?

Thanks a lot for your help. Sorry if this is an obviously stupid question!

Cheers Sally

pd3 commented 3 years ago

This note was about parallelizing the calling code being difficult. You can safely split the calling into regions (e.g. chr1:1-1000000, chr1:1000001-2000000, etc.) and run separately, then merge the results afterwards; mpileup considers all reads overlapping the given region. This is how the program is routinely used.