oicr-gsi / bamqc-legacy

Perl scripts for generating quality control stats from BAM files
https://oicr-gsi.github.io/bamqc/bamqc.html
Other
0 stars 0 forks source link

procCigar reversing cigar string #10

Open slazicoicr opened 5 years ago

slazicoicr commented 5 years ago

The parsed CIGAR string is reversed if it is on the - strand: https://github.com/oicr-gsi/bamqc/blob/a3964603efc86334e3345d30c331c9aacab256b2/lib/GSI/bamqc.pm#L1900

I assume the $strand variable is derived from the read reverse strand (16) flag.

I am not sure how this feeds into downstream calculations, but I think that reversing it is unnecessary and could cause issues:

This thread states that reads from the reverse strands are reverse-complemented by aligners to ensure their orientation matches the reference. This means that:

the CIGAR reports the operations performed on the reference FORWARD strand also if the read comes from the reverse strand the sequence reported in the SEQ field is the one on the FORWARD strand, even if the original read comes from the reverse strand

So by reversing the CIGAR string:

By never reversing the CIGAR string, the CIGAR string, read sequence, and reference sequence would always remain in sync.

iainrb commented 5 years ago

It looks like the reason for reversing @cigarOp is to calculate various statistics by cycle. If done in a clear and well structured way, that might be OK, but this code falls a long way short. Something to refactor when time permits.

slazicoicr commented 5 years ago

I think the reversing does introduce false data at

https://github.com/oicr-gsi/bamqc/blob/c6ea90b3005570b5e37cef0105d70235a5f5fb9c/lib/GSI/bamqc.pm#L656-L667

@Op has the reversed cigar string, but the $start1 position and the incrementing $start1 + $posOffset are not changed to reflect a reversed fragment. I think it is correct to use the non-reversed cigar string here.

I don't think this error is propagated, as sub runningBaseCoverage does not use the incorrect positions for calculations and a quick search did not find the the values of {runningBaseCoverage} to be used anywhere else.