How does mpileup adjust-MQ actually work?

samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html

http://samtools.github.io/bcftools/

Other

662 stars 240 forks source link

How does mpileup adjust-MQ actually work? #2217

Closed gmanthey closed 1 month ago

gmanthey commented 3 months ago

Hi there, first of all, thanks for making bcftools! It's really amazing to have these kind of tools available.

I find the documentation around the mpileup option --adjust-MQ a bit lacking though and the option seems to sometimes give unexpected results.

Just from the documentation, I would expect the adjusted mapping quality to follow the given formula: $adjust = \sqrt{\frac{option - quality}{option}}\cdot option$, which would result in the following relationship between mapping quality and adjusted mapping quality (assuming that mapping quality values larger than the value given are left as is): this however seems counterintuitive, as it would boost really bad aligning reads and reduce the value for better aligning ones (or is this what is intended?)
Looking at the code it seems like bcftools calculates the mapping quality used for the adjustment independently of the reported mapping quality from the aligner (see https://github.com/samtools/htslib/blob/develop/realn.c#L39 which is what is called in mpileup). I have not looked at this specifically in illumina data, but I have an example in 10x data where all reads at a position have a reported mapping quality of 60, but some of them seem to get adjusted to below 20 (using -C 50 -q 20), judging from the resulting DP from mpileup. It seems like there is no documentation on how this quality is calculated (and I have not managed to fully understand it from the code) and for which type of data this is applicable.

I'm happy to contribute to the documentation to make this option a bit clearer once I actually understand how it works.

robmaz commented 2 months ago

The problem that mismatch-based MQ readjustment tries to solve is that bwa tends to spit out MQ 60 if it finds any unique alignment, even if the read has to be completely disfigured to achieve that. So, if you have a crap read full of errors or a contaminant read that does not match the reference at all, bwa may truncate it to a fraction of its size and put it somewhere with half of the bases still not matching, and yet proudly declare MQ 60. This is obviously problematic as it produces false high-quality polymorphisms.

The documentation of the feature is basically wrong, though.

gmanthey commented 2 months ago

From what I can tell from the bwa code (https://github.com/lh3/bwa/blob/master/bwamem.c#L982) (and bwa-mem2 as well for that matter), they seem to be using mismatches as well for their mapq calculation. They do seem to be ignoring truncations though. This also fits with some testing I did now, where the mapping quality does reduce if the read has mismatches even if there only is a single mapping location. Here is an example sam output of a uniquely mapping read that has a non 60 MQ score (interestingly, bwa-mem2 fails to match a read with this number of mismatches):

@SQ SN:ref  LN:450
@HD VN:1.5  SO:unsorted GO:query
@PG ID:bwa  PN:bwa  VN:0.7.18-r1243-dirty   CL:bwa mem -a reference.fasta read.fastq -o alignment2.sam
read    0   ref 76  41  75M *   0   0   AGTCACGTAGCTACGATCGATCGGTCGATCGAGCGACTGATCGGTCGGTCGTGGGTACGAGCCATGCGAGCTAGC ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NM:i:7  MD:Z:23A8T10A3A12T1G4T7 AS:i:40 XS:i:0

I also found in the meantime that the -C feature in my case (with values I copied from someone else, shows that this is usually not a good idea :D) biases against reads from the alternate allele in regions with high SNP density, causing false REF calls where HET calls should be correct.

I will try to figure out how the value given for -C actually relates to the resulting mapping quality given mismatches in the read such that one can make a better informed decision on what value to choose here and then make a PR for the documentation.

gmanthey commented 2 months ago

After looking at the code, I've come up with the following of how the adjusted mapping quality is calculated. For a read with:

$n$: number of aligned bases (not clipped/INDELS) + number of unambiguously aligned bases (qual > 13, reference not N) (it feels like it may be unintentional that both are counted here?)
$n_{m}$: number of mismatching bases
$q$: sum of quality values for mismatching bases
$q_{clip}$: sum of quality values for clipped bases (13 for hard clipped bases)

first a mismatch score (lets call it $s$) is calculated:

$$ s = q + \frac{q{clip}}{5} - 4.343 \log(\frac{n^{n{m}}}{n_{m}!}) $$

(I'm not quite sure where this magic number comes from and what the log afterwards means.. My best guess is that it somehow corrects for the length of the read, i.e. a long read may have more errors than a short one. However, this clearly does not scale linear with read length, whereas you would expect the probability of getting a mismatch by evolutionary processes to scale linearly with read length. Also the dependency on number of mismatches in this term is weird, as a higher number of mismatches would increase the subtracted value here? And in fact, one can construct cases where reads with a higher number of mismatches get a higher adjusted MQ value compared to reads with no mismatches.)

Then this is used together with the threshold supplied in -C $thresh$ to calculate a final MQ cap:

If $s > thresh$, the read is discarded as having to many mismatches
If $s < 0$, $s$ is set to 0, which causes the read to get the maximum MQ of $thresh$
Then, the MQ cap is calculated as:

$$ \left \lceil \sqrt{ \frac{thresh - s}{thresh} } \cdot thresh \right \rceil $$

This essentially translates to the following relationships:

higher quality mismatches and clipped bases cause a lower MQ
clipped bases count as 1/5th of a mismatch

So we can now come up with a formula to calculate the minimum value for -C to still allow a certain number of high quality mismatches $n_m$ (this can be calculated together with the read length $n$ and the snp density one wants to allow for $d$ with $n \cdot d$). $Q$ is the minimum mapping quality used for filtering (supplied in -q)

$$ \left \lceil - \frac{4.343 \log(\frac{n^{n_m}}{n_m!}) - 33 n_m}{2} + \sqrt{\frac{(4.343 \log(\frac{n^{n_m}}{n_m!}) - 33 n_m)^2}{4} + Q^2} \right \rceil $$

So for example, if I want to allow for up to 3 SNPs in a read with a read length of 150 and a -q of 20, I need to choose at least 42 for -C.

gmanthey commented 2 months ago

Digging through the commit history it seems like this feature is 3 years older than bwa mem (which I had equated with bwa, as probably many people would as it seems like the more widely used aligner nowadays). So it seems like this was made for the (much simpler) MQ calculation in bwa (which was probably what @robmaz meant in his post, I just didn't understand :D).

I have so far not seen any case where a read aligned by bwa mem (or bwa for that matter), was clearly misaligned though and would need this kind of filtering. If anyone has examples, I would love to see them!

goranrakocevic commented 2 months ago

It's been a long time since I worked on a an aligner and MQ, but it does happen; irrc at least one example were some heavily clipped reads coming from a non-human contaminant. You will find those at the very edge of the minimum alignment score to be output (-T, 30 by default). I'm sure there would be other cases, but not that many.

Perhaps this is already clear (apologies if so): BWA-MEM does use mismatches, but in a manner relative to the number of mismatches at second best hit (the bigger the difference in the number of misses (more precisely, the SW score), the bigger the MQ).

However, there is often a somewhat indirect penalty wrt the absolute number of mismatches, as they start acting as something that may very loosely be though of as "degrees of freedom"; if my top hit has 10 mismatches, that means I get to mutate 10 positions in searching for an equally good hit (that would make the read multi-mapped), or 11 positions in search of hit that is almost as good etc.

All this said, I think the behaviour of BWA-MEM is much more a feature than a bug:

Not penalising soft clips mean adapters and low quality read ends are sort of OK without trimming; also means the aligner can be reasonably accurate near the breakpoints bigger variants (SVs and such)
It also means that in many cases the aligner places the reads correctly around and over medium-sized indels (10-30 bp or so), which can be handled by tools like GATK HaplotypeCaller (i.e. positions is correct, but CIGAR is a throwaway).

gmanthey commented 2 months ago

It's been a long time since I worked on a an aligner and MQ, but it does happen; irrc at least one example were some heavily clipped reads coming from a non-human contaminant. You will find those at the very edge of the minimum alignment score to be output (-T, 30 by default). I'm sure there would be other cases, but not that many.

I'll see if I can filter for this and see how they actually look in my own data then, thanks for the pointer!

BWA-MEM does use mismatches, but in a manner relative to the number of mismatches at second best hit (the bigger the difference in the number of misses (more precisely, the SW score), the bigger the MQ).

While this is true, it seems from the code that also the absolute mismatches go into the calculation of the MQ directly. In https://github.com/lh3/bwa/blob/master/bwamem.c#L989, the identity is calculated from the SW score from just the best hit and then goes either as identity^4 or as identity^2 into the MQ (dependent on what value for -Q is given, an option that does not appear to be documented in the man page... ). Maybe it's also time to properly document the MQ calculation in bwa mem..

All this said, I think the behaviour of BWA-MEM is much more a feature than a bug:

I agree, it seems like this works well for most of the cases. I also think the issue is much more around lacking documentation of things and then possibly wrong choices being made when using the available options.

jkbonfield commented 1 month ago

I did look at this several years back and document it somewhere, but I've no idea where or what for. It's basically a poorly documented heuristic which I suspect may have been first invented for maq, prior to bwa. As others have indicated, this feels like the wrong place for such a heuristic. If it's demonstrated to improve mapping quality, then the aligner should incorporate it. If it's not improving the accuracy of MAPQ then it should just be dropped. Thankfully it's turned off by default.

I'll have a stab at documented it fully, albeit in the full gory horrific details. I cannot however make any justification for why it is like it is. I'm tempted to also advise against its use. We can't really disable such things as changing anything to do with mpileup is liable to break pipelines, but I'm uneasy about recommending such usage given the nature of the data (both read length and quality profiles) will have changed considerably since the values in those heuristics were first derived.

gmanthey commented 1 month ago

Thanks for chiming in! Feel free to use any of the formulas I put together here for the documentation, I'm at this point very certain that they reflect what is happening in mpileup (after also having done some testing). Any thoughts on this bit?

And in fact, one can construct cases where reads with a higher number of mismatches get a higher adjusted MQ value compared to reads with no mismatches.

This seems to essentially happen when there are some clipped bases. Then having some not super high quality mismatches is better than not having mismatches.

Also the documentation of samtools needs to be updated, there the recommendation is to use -C 50 for bwa (see http://www.htslib.org/doc/samtools-mpileup.html and especially http://www.htslib.org/doc/samtools.html#SYNOPSIS). I'm also surprised to see that the move of mpileup to bcftools is not mentioned in the samtools documentation.

jkbonfield commented 1 month ago

Samtools mpileup cannot produce VCF any more, so it doesn't serve the same purpose as bcftools' implementation.

The philosophy there was to keep the primary variant calling components within the one package for ease of maintainer work. It also helps for software packaging, as variant calling in bcftools is mpileup | call so having both version controlled together makes much more sense. However the text mode of mpileup from samtools is still used by other tools.

In essence it means we have two different use cases sharing the same sub-command name, with some shared code in htslib, such as this adjust-MQ code.

jkbonfield commented 1 month ago

An oddity of this code is that indels don't count as mismatches, so they're essentially free when it comes to adjusting the mapping quality. Almost certainly this is a characteristic of the technology being used at the time - Illumina or maybe even Solexa. Similarly the hard-coded qual >= 13 (and capped at 33) was based on qualities coming from that instrument. It makes this option woefully inappropriate for many newer machine technologies.

As for some of the other maths, note that 4.343 is ~10 log(10). Ie it's the phred scale which is also used in computing mapping quality. It makes a little more sense then. The factorial component is probably some combinatorial equation, but I'm not immediately spotting it.

The code originated in https://github.com/samtools/samtools/commit/51f014165fb63b2b9b86db80ea4681f9351453a1 with a minimal commit message. Documentation on the option appeared a couple weeks later in https://github.com/samtools/samtools/commit/dda27af89b70c52e075f6531d56c0bbccbc7246d#diff-23bd30520e73a8aabf1fc37fdd4b7515ca5bb82dd08c6ff96aaf98d439f0cd46R273-R278 but also minimally and essentially unchanged since then.

Right from the very start it's had curious code oddities. Like what is clip_l for? It's computed and then never used. It also checks qual >= 13 in two nested if statements so the second is a nop. It feels like it was an experiment which got accepted as it worked on that data set, and is now cast in stone.

It predates MAQ's first commit according to the git logs, but bwa has a ChangeLog file showing commits prior to git. It looks like it was a local subversion project before that and the git one wasn't converted in a way to maintain logs. So bwa goes back to May 2008. About 1 month after the last commit to MAQ.

All of this is just code archaeology though and doesn't really help to document this "feature". I can't use images as this has to be in the text man pages too, but something simple can replace that. I'm working on it.