Open gmagoon opened 6 years ago
PS...I've just spotted closed issue #133 and it sounds like the behavior is by design. So I guess my question becomes: Is there a recommended approach for filtering by score for paired read alignments?
@gmagoon Thanks for documenting this. This confused me too. In addition to #133, I found this query in the bwa mail list. So we're not the only ones!
@lh3 If it's by design, it would be helpful to document it. But it does seem like something that could be fixed. My gut would be to apply to threshold to each alignment separately (FWIW).
I just discovered this issue after a certain dataset was really giving surprising results. Off target amplification products give short (20-25 base) alignments to a given reference sequence really screwing up a bunch of calculations. Implementing (at least optionally) the Minimum score threshold for PE reads would completely solve this. @lh3 is there some reason it's not enabled?
@gmagoon the alignment for a pair of reads should probably also include the mate-distance into the score; if it is too far away then it should be penalized as well. But I agree this behaviour was unexpected and should be documented.
-T INT Don't output alignment with score lower than INT.
This option affects output and occasionally SAM flag
2. [30]
samtools flags 2
0x2 2 PROPER_PAIR
Based on testing and browsing the
bwamem.c
andbwamem_pair.c
code, it appears to me that the minimum score threshold (set by the-T
option, with a default of 30) is not being applied to paired read alignments in most cases. As far as I can tell, for paired reads, it is only being used for ALT hits, as implemented in the following line: https://github.com/lh3/bwa/blob/fde1fd7ce3f52e74907169b927922a12a9feb000/bwamem_pair.c#L347Based on the way this threshold is being applied for unpaired reads and ALT hits, I might expect to see something preceding the following line, in order to filter the typical paired read situation: https://github.com/lh3/bwa/blob/fde1fd7ce3f52e74907169b927922a12a9feb000/bwamem_pair.c#L340 [It is not immediately obvious to me whether it would be better to apply the threshold to the score for each read individually or to the combined score for the pair...but my gut inclination would be to apply to combined score.]
Maybe the current behavior is intentional for some reason?