bamtofastq gets confused over near-identical read names

mjafin commented 8 years ago

Hi there, Many thanks for all the work on biobambam2, much appreciated. I've come across a probably (version?) name sorting problem in bamtofastq.

The eventual error manifests itself in bwa as

[mem_sam_pe] paired reads have different names: "0302t6w", "0302t06w"

Looking at the original bam file there are reads with similar read names in this order:

0302t6wM        99      chr1    16254844
...
0302t6wM        147     chr1    16254937
...
0302t06w        65      chr3    45533685
...
0302t6ww        99      chr7    74710731
...
0302t6ww        147     chr7    74710799
...
0302t6wO        99      chr12   50005111
...
0302t6wO        147     chr12   50005241
...
0302t06w        129     chr13   37453392
...
0302t6w 97      chr13   43262597
...
0302t6w 145     chrX    33099514
...

After bamtofastq the files have these reads only:

@0302t6wM/1
@0302t6ww/1
@0302t6wO/1
@0302t6w/1
@0302t6wM/2
@0302t6ww/2
@0302t6wO/2
@0302t06w/2

in other words, @0302t6w/2 and @0302t06w/1 are missing.

The command is roughly like this:

bamtofastq filename=my.sorted.bam T=temp.sorted-1.fq-sort F=>(bgzip -c /dev/stdin > my.sorted-1.fq.gz) F2=>(bgzip -c /dev/stdin > my.sorted-2.fq.gz) S=/dev/null O=/dev/null O2=/dev/null collate=1 colsbs=2097152

Could this have something to do with internal version-name sorting and the near identical read names (save for the zero in between t and 6)?

mjafin commented 8 years ago

@chapmanb

gt1 commented 8 years ago

Yes, names only differing in the number of leading zeros in number representations were considered as equal during collation. Could you please try version 2.0.39? I have changed the name comparison scheme.

mjafin commented 8 years ago

@gt1 thanks for the super prompt reply, I'll give that a go

mjafin commented 8 years ago

Working great, thanks @gt1

gt1 / biobambam2

bamtofastq gets confused over near-identical read names #18