This commit fixes two bugs related to reverse complement.
Logging reverse complement sequence in PacBio mode.
Not calling Viterbi on reverse complement sequence.
Details
1. Logging reverse complement sequence in PacBio mode.
All sequence in log file is designed to be on "forward-strand". That is, all reads aligned to the reverse-strand should be reverse-complemented (converted to forward-strand) before logging.
However, the logged sequence is reverse-complemented when it's mapped to the reverse-strand. (adVNTR code link)
adVNTR reverse-complement the sequence when it's mapped to reverse-strand.
But, it's already reverse-complemented by default (pysam)
-The confusion was introduced by the pysam library. The seq is automatically reverse-complemented. See pysam documentation (.seq is deprecated, but same as query_sequence).
Why does it matter?
pairwise_aln_generator.py uses the logged sequence to generate an alignment file.
The logged sequence should be always in "forward_strand" to match to HMM.
However, we are logging reverse_complemented sequences if the sequence is mapped to the reverse_strand.
That caused very poor alignment on the pairwise_alignment tool making reads filtered out in many cases.
2. Not calling Viterbi on reverse complement sequence.
If a read is mapped, we already know which strand the read is mapped.
Therefore, we don't need to call Viterbi on both the sequence and the reverse complement of it because we know which sequence mapped to the reference (forward strand).
For unmapped reads, we try both and save the "forward sequence" that has the higher score.
It's expected that this fix reduces the running time significantly as Viterbi takes most of the running time.
Description
This commit fixes two bugs related to reverse complement.
Details
1. Logging reverse complement sequence in PacBio mode.
Why does it matter?
pairwise_aln_generator.py
uses the logged sequence to generate an alignment file.2. Not calling Viterbi on reverse complement sequence.
If a read is mapped, we already know which strand the read is mapped. Therefore, we don't need to call Viterbi on both the sequence and the reverse complement of it because we know which sequence mapped to the reference (forward strand). For unmapped reads, we try both and save the "forward sequence" that has the higher score.
It's expected that this fix reduces the running time significantly as Viterbi takes most of the running time.