luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
302 stars 38 forks source link

Overlapping Paired-end Reads and their Contribution to Allele Depth #195

Closed tinyheero closed 3 years ago

tinyheero commented 3 years ago

Hi there,

When working with an Illumina paired-end sequencing dataset that has small insert sizes, this leads the paired-end reads overlapping in the middle part of the insert. How does Octopus use this overlapping information in allelic depth calculations? Does it treat the paired-end reads as a single fragment and thus only count them as contributing once to the allelic depth? Or does it treat each read independent and thus double count them?

Kind regards,

dancooke commented 3 years ago

Hi, thanks for your interest in Octopus. Overlapping read pairs are handled by masking (setting the base-quality to zero) overlapped bases, where the base with the lowest quality is masked. This can be disabled with the --disable-overlap-masking option. This read transformation prevents errors being double-counted in the haplotype-likelihood calculation, but does not affect empirical depth metrics (i.e. overlapped bases are double-counted for DP).

dancooke commented 3 years ago

The specific case of allele depth - AD is a bit more involved. To calculate AD, Octopus assigns reads to haplotypes by computing likelihoods of all called haplotypes in the phase block the record is contained by. A read only counts towards AD if the haplotype(s) the read is assigned to include the same allele at the site. If the base-qualities of a read overlapping a variant site are 0, then those bases are uninformative and will not contribute to assigning the read to a haplotype. If this is the only site that disambiguates haplotypes in the phase block then the read will be ambiguous and will not contribute to AD (i.e. will not be double-counted). On the other hand, if there are other variant sites that can result in the read being assigned to a haplotype then the read will contribute to AD. The ADP annotation reports the total number of 'assigned' reads.

Having said that, AD is an annotation computed during variant filtering, and by default read filtering and transformations are disabled for filtering (see --use-preprocessed-reads-for-filtering).

tinyheero commented 3 years ago

Thank you @dancooke! That's very helpful.