rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
76 stars 17 forks source link

Read Length Normalization #65

Closed Deeeeen closed 2 years ago

Deeeeen commented 2 years ago

Hi Robbie,

When I run STITCH, and when it is inputting reads from BAM files, I saw STITCH printing out how many reads for each sample got removed due to different number of reads for different samples. I am wondering if STITCH does normalization on number of reads when inputting reads from BAM files. If yes, could you explain a little bit about how STITCH does the normalization? or point me to where to find more information about this?

Best, Den

rwdavies commented 2 years ago

Hi Den,

Reads are only removed for downsampling reasons, to minimize the risk of under / overflow. This is controlled by the parameter downsampleToCov. This is because the method doesn't work in log space, so if the coverage gets too high at one site, and if the reads are long enough, the emission probabilities can get small enough (think < ` x 10 ^{-300}). This hopefully shouldn't remove many reads. In any case, the reads it removes shouldn't be that informative for a low-coverage imputation method. If you use grids (a non-default option), this also applies further in this case.

STITCH otherwise does no downsampling or normalization of reads

Thanks Robbie

Deeeeen commented 2 years ago

Thanks Robbie! This is super clear!