samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
656 stars 240 forks source link

[mpileup2] How does mpileup choose the reads for later calling, and is the -d flag behaving strangely? #2234

Open goeckeritz opened 1 month ago

goeckeritz commented 1 month ago

Hi bcftools team,

I have a conceptual question about mpileup and I think the -d flag is either being weird or I'm misunderstanding how it works. I am using bcftools v. 1.18, and the .bam file in question contains a single sample.

I have reads piling up at some problematic genomic regions (we're talking 10-20X the average depth), and I don't really want to trust the variants in those areas. For example, there will be 95% of reads at a site supporting the REF allele, and 5% supporting an ALT allele, so even if all or the majority of reads were included, I'd expect the site to be called homozygous REF. I should note that almost all of the reads at this site will be Primary, too, with a MAPQ > 30.

However, even when the -d flag is set to 250, I'll have only 30 or so reads out of ~600 be chosen for future variant calling and contributing to DP, so this variant site doesn't get filtered out with my max depth filter later on in my pipeline. Moreover, those 30 reads will include almost ALL of those reads from the ALT allele; this results in this site being called heterozygous REF/ALT.

So my first question is: How does mpileup choose which reads to include for later calling -- i.e., how does it choose how many and which reads contribute to DP? It definitely doesn't appear to be random, since in the example above it grabbed almost all of the ALT supporting reads, despite them being 5% of the high quality mappings at that site.

My second question is: Why is the DP at this site not 250, if that's my maximum number of reads to give to mpileup? When I set -d to 1000 or 2000, mpileup grabs 166 reads for this site (DP=166). It's not until I set -d to 10000 does it grab most of the reads at that site for future variant calling. It looks like this issue was touched on in #1694 -- but in the opposite direction and I'm not sure what was meant by "Hence the filtering isn't as strict as it may sound, an is perhaps closer to an average maximum depth threshold rather than something to take as a literal limit." Also, I'm not necessarily trying to filter on the -d flag either. I just want more of the data represented in genotype likelihood assessments, especially when they're relatively high quality -- and therefore I'll know that this site is a problematic region with lots of reads piling up so I CAN filter it out later.

Since -d or --max-depth is defined as 'At a position, read maximally INT reads per input file', this behavior doesn't seem intuitive to me.. But it sounds like there is more to the -d flag that I'm not comprehending, and that it isn't necessarily a bug, right?

Thanks in advance for any time you might have for answering these questions. If it's easier to just point to some reading material, I'm happy to do some more legwork to make your life easier!

Kindly, Charity

pd3 commented 1 month ago

There are two read subsetting steps.

First, the mpileup -d option controls the maximum number of reads in the pileup. Unfortunately, this is not random, the buffer fills up to the maximum depth and it starts to throw away spurious reads. This is to be addressed in the new mpileup2 code https://github.com/samtools/bcftools/pull/1842. As to your second question, he comment https://github.com/samtools/bcftools/issues/1694#issuecomment-1087269835 answers it rather well: as more reads come at a later position, at some point it reaches the maximum of 250 reads. However, because many reads don't start at the same position, the coverage will be lower to the left of the maximum.

Then there is subsetting of reads at the genotype likelihood calculation step. This is invisible to the user and cannot be controlled by any option: the pre-computed auxiliary tables use maximum 255 reads. This is sufficient for germline variant calling in practice.

I'll mark this issue as enhancement, the future mpileup2 code will address it.