lh3 / bwa

Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)
GNU General Public License v3.0
1.55k stars 556 forks source link

Minimum score threshold (`-T` option) is not applied to paired reads #187

Open gmagoon opened 6 years ago

gmagoon commented 6 years ago

Based on testing and browsing the bwamem.c and bwamem_pair.c code, it appears to me that the minimum score threshold (set by the -T option, with a default of 30) is not being applied to paired read alignments in most cases. As far as I can tell, for paired reads, it is only being used for ALT hits, as implemented in the following line: https://github.com/lh3/bwa/blob/fde1fd7ce3f52e74907169b927922a12a9feb000/bwamem_pair.c#L347

Based on the way this threshold is being applied for unpaired reads and ALT hits, I might expect to see something preceding the following line, in order to filter the typical paired read situation: https://github.com/lh3/bwa/blob/fde1fd7ce3f52e74907169b927922a12a9feb000/bwamem_pair.c#L340 [It is not immediately obvious to me whether it would be better to apply the threshold to the score for each read individually or to the combined score for the pair...but my gut inclination would be to apply to combined score.]

Maybe the current behavior is intentional for some reason?

gmagoon commented 6 years ago

PS...I've just spotted closed issue #133 and it sounds like the behavior is by design. So I guess my question becomes: Is there a recommended approach for filtering by score for paired read alignments?

odoublewen commented 6 years ago

@gmagoon Thanks for documenting this. This confused me too. In addition to #133, I found this query in the bwa mail list. So we're not the only ones!

@lh3 If it's by design, it would be helpful to document it. But it does seem like something that could be fixed. My gut would be to apply to threshold to each alignment separately (FWIW).

donutbrew commented 5 years ago

I just discovered this issue after a certain dataset was really giving surprising results. Off target amplification products give short (20-25 base) alignments to a given reference sequence really screwing up a bunch of calculations. Implementing (at least optionally) the Minimum score threshold for PE reads would completely solve this. @lh3 is there some reason it's not enabled?

tseemann commented 5 years ago

@gmagoon the alignment for a pair of reads should probably also include the mate-distance into the score; if it is too far away then it should be penalized as well. But I agree this behaviour was unexpected and should be documented.

tseemann commented 5 years ago
              -T INT    Don't  output  alignment  with  score  lower than INT.
                        This option affects output and occasionally  SAM  flag
                        2. [30]
samtools flags 2
0x2     2       PROPER_PAIR