gt1 / biobambam2

Tools for early stage alignment file processing
Other
93 stars 17 forks source link

bamcollate2 not reporting all reads #58

Open mirahan opened 6 years ago

mirahan commented 6 years ago

Hi,

I am trying to use bamcollate2 on a RNA-seq bam file aligned with map-splice. But, after processing with bamcollate2 I am missing reads that were mapped to multiple locations. can you let me know how I can get all the reads out from bamcollate? The file that I am using is the RNA-seq bam file found on the GDC legacy website with file id 9b1a94fa-d6e8-49c5-a552-2da0e0ffe893. The outputs from the file before and after collating are below. before there are three reads (two lines for one read which is a fusion alignment) and after bamcollate2 there are only two reads in the bam file.

[mirahan@hanlab-dell1 pre-mrna]$ samtools view bam/TCGA-CG-5720-01A.ver2.bam | grep HS2_251:8:1101:1049:197409 HS2_251:8:1101:1049:197409/2 115 chr2 133038633 255 21M54S = 230045563 -97006910 GTTCAACTGCTGTTCACATGGTCGCCCGTCCCTTCGGAACGGCGCTCGCCCATCTCTCAGGACCGACTGACCCAT @B@FFFEFDDBHHGGHHFGIIIIGA?CBF9EHIGH>GHHH?G8BGHIIIIJBBEEF=ACC@BB@@B@BB@BB# XF:Z:ATAC, ZF:Z:FUS_133038633_230045616(--) RG:Z:EXTERNAL_9ed6bfd4-c1c1-44ae-88d8-ff01224de6de_20141215_1108581 IH:i:2 YH:Z:1.1 HI:i:1 YI:i:1 NM:i:3 XS:A:+ HS2_251:8:1101:1049:197409/2 499 chr2 230045563 255 21S54M = 133038633 97006910 GTTCAACTGCTGTTCACATGGTCGCCCGTCCCTTCGGAACGGCGCTCGCCCATCTCTCAGGACCGACTGACCCAT @B@FFFEFDDBHHGGHHFGIIIIGA?CBF9EHIGH>GHHH?G8BGHIIIIJBBEEF=ACC@BB@@B@BB@BB# XF:Z:ATAC, ZF:Z:FUS_133038633_230045616(--) RG:Z:EXTERNAL_9ed6bfd4-c1c1-44ae-88d8-ff01224de6de_20141215_1108581 IH:i:2 YH:Z:1.2 HI:i:2 YI:i:1 NM:i:3 XS:A:+ HS2_251:8:1101:1049:197409/1 69 0 0 0 0 CNGGGGATCTGAACCCGACTCCCTTTCGATCGGCCGAGGGCAACGGAGGCCATCGCCCGTCCCTTCGGAACGGCG @#1=ADBDFHFHFGHIIIBHIIG9CGFF;?DABBDEGIEGHEHEEB?C/;?<C?7(38<7?@BCCCBBBBBBBBB RG:Z:EXTERNAL_9ed6bfd4-c1c1-44ae-88d8-ff01224de6de_20141215_1108581 IH:i:0 HI:i:0 [mirahan@hanlab-dell1 pre-mrna]$ samtools view TCGA-CG-5620-01A.collated.bam | grep HS2_251:8:1101:1049:197409 HS2_251:8:1101:1049:197409/1 69 0 0 0 0 CNGGGGATCTGAACCCGACTCCCTTTCGATCGGCCGAGGGCAACGGAGGCCATCGCCCGTCCCTTCGGAACGGCG @#1=ADBDFHFHFGHIIIBHIIG9CGFF;?DABBDEGIEGHEHEEB?C/;?<C?7(38<7?@BCCCBBBBBBBBB RG:Z:EXTERNAL_9ed6bfd4-c1c1-44ae-88d8-ff01224de6de_20141215_1108581 IH:i:0 HI:i:0 HS2_251:8:1101:1049:197409/2 115 chr2 133038633 255 21M54S = 230045563 -97006910 GTTCAACTGCTGTTCACATGGTCGCCCGTCCCTTCGGAACGGCGCTCGCCCATCTCTCAGGACCGACTGACCCAT @B@FFFEFDDBHHGGHHFGIIIIGA?CBF9EHIGH>GHHH?G8BGHIIIIJBBEEF=ACC@BB@@B@BB@BB# XF:Z:ATAC, ZF:Z:FUS_133038633_230045616(--) RG:Z:EXTERNAL_9ed6bfd4-c1c1-44ae-88d8-ff01224de6de_20141215_1108581 IH:i:2 YH:Z:1.1 HI:i:1 YI:i:1 NM:i:3 XS:A:+ [mirahan@hanlab-dell1 pre-mrna]$

gt1 commented 6 years ago

Hi,

I cannot access the file, but I would assume the rest of the alignments for the read in question are marked as secondary or supplementary. bamcollate2 filters these out by default.

gt1 commented 6 years ago

bamcollate2 currently does not handle cases well with secondary alignments. I would suggest you try the latest bamsort with sort order queryname_HI for this. This should collate alignment pairs including secondary alignments.