twolinin / longphase

GNU General Public License v3.0
99 stars 9 forks source link

Duplicate lines found in haplotag result #20

Closed btrainee closed 4 months ago

btrainee commented 2 years ago

Hi ,team: Thans for longphase. I have two questions to ask , the first as stated in the title, I wonder why there is many duplicate lines in haplotag.out.bam result file or haplotag.log file. Here is some duplicate lines in haplotag.log file as an example and the same problem in the the haplotag.out.bam file. image ERR3861382.469833 ptg000197l 17571 -nan . 0 0 0 ERR3861382.469833 ptg000197l 17571 -nan . 0 0 0 ERR3861382.349163 ptg000197l 105638 1 1 31 31 0 105653,0 106571,0 106706,0 107034,0 107087,0 107109,0 107364,0 107498,0 107504,0 107577,0 107580,0 107608,0 107750,0 107992,0 108000,0 108111,0 108167,0 108191,0 108258,0 108308,0 108418,0 108444,0 108447,0 108467,0 108485,0 108492,0 108498,0 108508,0 108525,0 108542,0 108552,0 ERR3861382.349163 ptg000197l 105638 1 1 31 31 0 105653,0 106571,0 106706,0 107034,0 107087,0 107109,0 107364,0 107498,0 107504,0 107577,0 107580,0 107608,0 107750,0 107992,0 108000,0 108111,0 108167,0 108191,0 108258,0 108308,0 108418,0 108444,0 108447,0 108467,0 108485,0 108492,0 108498,0 108508,0 108525,0 108542,0 108552,0 wc -l haplotag_snponly.out 6920552 haplotag_snponly.out sort -u haplotag_snponly.out|wc -l 5792611

Another question , I found the haplotaged reads (hap1 reads ,hap2 reads and unhap reads) incomplete,that there are many reads in the raw long reads not haplotaged as hap1 ,hap2 or unhaped excepted unmapped reads . Whether should all no tagged reads appear in the haplotag.out.bam result file or haplotag.log file?

twolinin commented 2 years ago

Hi @btrainee

I checked my bam file but didn't find this problem. Could you please check whether the bam file has duplicate alignment?

$ wc -l test.out
6744470 test.out
$ sort -u test.out | wc -l
6744471

thanks

btrainee commented 2 years ago

Sorry @twolinin . I checked my alignment bam file and there has indeed duplicate lines.