tecangenomics / nudup

NuDup -- Marks/removes duplicate molecules based on the molecular tagging technology used in Tecan products.
http://www.tecangenomics.com
GNU Lesser General Public License v3.0
14 stars 9 forks source link

shorten qname in output bam file in --rmdup-only mode #22

Open frankyan opened 5 years ago

frankyan commented 5 years ago

Hi, I found that when in "--rmdup-only" mode, the qname (read name) of each line in output bam file was shorten by the number of (len_dupindex + 1). For example:

# original STAR mapping bam:
MN00727:7:000H2HJHJ:1:12110:12180:16802 16      chr10   63493   255     71M
# the dedup output:
MN00727:7:000H2HJHJ:1:12110:12  16      chr10   63493   255     71M

I checked the source code and found the bug. It's in line 307 and 308, which apply UMIStripFunc by _get_sam_str two times to sam_row.

sam_str = self._get_sam_str(sam_row)
self._torm.stdin.write(self._get_sam_str(sam_row))

I have pulled a request to fix this problem #23 . I used single end sequencing data. I did't test paired-end data.

mlovci commented 5 years ago

Thank you for submitting your PR and reporting this issue, we'll review ASAP!

karaulanov commented 3 years ago

Is there a reason why the suggested bug-fix (which b.t.w. works also for paired-end data) hasn't yet been implemented and the issue is still open?