gt1 / biobambam2

Tools for early stage alignment file processing
Other
93 stars 17 forks source link

indexing fails on sparse BAMs #10

Closed keiranmraine closed 8 years ago

keiranmraine commented 8 years ago

Hi,

We have a BAM file that only has a very small number of reads which only hit a handful of the reference sequences. It appears that this triggers a check failure:

$ bamindex -v
This is biobambam2 version 2.0.25.
biobambam2 is distributed under version 3 of the GNU General Public License.

$ bamindex tmpfile=tmpbamindex < 4911821_sorted_rmdup.bam > 4911821_sorted_rmdup.bam.bai
[V] 1 554204 3047
[V] 2 334746 1804
[V] 6 1 0
[V] Y 0 1
[V] 893803
BamIndexGenerator::checkConsisteny(): inconsistent binCIS.peek()=23 != linCIS.peek()=-1
bin index and linear index are inconsistent.

/software/CGP/pancan/bin/../lib/libmaus2.so.2(libmaus2::util::StackTrace::StackTrace()+0x4c)[0x7f572ed17a8c:??:0]
/software/CGP/pancan/bin/bamindex(libmaus2::exception::LibMausException::LibMausException()+0x20)[0x40d410:??:0]
/software/CGP/pancan/bin/bamindex()[0x421a35:??:0]
/software/CGP/pancan/bin/bamindex()[0x40b90e:??:0]
/software/CGP/pancan/bin/bamindex()[0x407e68:??:0]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f572bf1876d:??:0]
/software/CGP/pancan/bin/bamindex()[0x4080f2:??:0]

The behaviour was initially noticed running bammarkduplicates2 with index=1.

Samtools indexes the file fine with the following stats:

$ samtools

Program: samtools (Tools for alignments in the SAM format)
Version: 0.1.18 (r982:295)
...
$ samtools index 4911821_sorted_rmdup.bam
$ samtools idxstats 4911821_sorted_rmdup.bam
1   249250621   554204  3047
2   243199373   334746  1804
3   198022430   0   0
4   191154276   0   0
5   180915260   0   0
6   171115067   1   0
7   159138663   0   0
8   146364022   0   0
9   141213431   0   0
10  135534747   0   0
11  135006516   0   0
12  133851895   0   0
13  115169878   0   0
14  107349540   0   0
15  102531392   0   0
16  90354753    0   0
17  81195210    0   0
18  78077248    0   0
19  59128983    0   0
20  63025520    0   0
21  48129895    0   0
22  51304566    0   0
X   155270560   0   0
Y   59373566    0   1
MT  16569   0   0
*   0   0   0
gt1 commented 8 years ago

Hi,

it looks like the input file contains reads which

This is a bit unusual as it probably means there was at some point a mate of a read which mapped to a reference sequence but was then removed at a later stage, leaving only the unmapped mate behind.

I have updated libmaus2/biobambam2 to print a warning when encountering this but not to quit indexing. Can you please try again with biobambam2 version 2.0.30?

Best, German

keiranmraine commented 8 years ago

Hi German,

This is now working. I suspect the cause is that our legacy pipeline uses rmdup=1.

$ ./bammarkduplicates2 rmdup=1 index=1 tmpfile=thingy I=4911821_sorted.bam O=fixed.bam M=fixed.met indexfilename=fixed.bam.bai level=1
[V] output compression level 1
[D] excntpairs=1 fincntpairs=441971 strcntpairs=145
[D] excntfrags=2 fincntfrags=889084 strcntfrags=0
[V] fragment and pair data computed in time 5.65365 (05:65362499)
[V] 893938 lines, 893938 als, 889086 mapped frags, 442117 mapped pairs, 190788 frags/s MemUsage(size=71.8516,rss=24.418,peak=504.395)
[V] Checking pairs...done, rate 11932
[V] Checking single fragments...done, rate 343.997
[V] number of alignments marked as duplicates: 135 time 5.68297 (05:68296299)
[V] Filtered 893937(0,1) total for marking time 07:66722199 MemUsage(size=71.8516,rss=24.6719,peak=504.395)
[W] BamIndexGenerator::checkConsisteny(): warning, bin chunks for refid 23 without corresponding linear chunks
[W] BamIndexGenerator::flush: warning, bin index and linear index look inconsistent
[V] MemUsage(size=71.8516,rss=24.6719,peak=504.395) 14.3264 (14:32641100)
$ echo $?
0
gt1 commented 8 years ago

Hi Keiran,

thank you for confirming this is fixed now.