Open jkbonfield opened 4 years ago
FWIW, this same thing happens when representing whole genome alignments (e.g. as produced by minimap2) in bam; if the query sequence is > 2^28 then the clipping operations that indicate how far along the query sequence the alignment started may be wrong, causing some problematic behavior when trying to interpret the alignments in a dotplot context, where both query and ref positions for each aligned block are used (ends up looking like a translocation of the latter part of the query chromosome back to the beginning). Granted, chromosome sequences aren't exactly "reads" but it is handy to be able to use bam to represent whole-genome alignments for use with other tools. So +1 to addressing this!
https://github.com/samtools/htslib/pull/976 was an attempt to fix this for CRAM, but I forget why we rejected it as an idea.
(I think trying to represent whole genome alignments in BAM is perhaps a bit of an abuse, but my PR was to fix something more mundane and realistic - TopHat alignments on non-human genomes.)
Maybe I'll resurrect the idea, and put in logic for the SAM parser too.
Thanks for the quick reply and agreed (per #976) that the error handling has been much improved- I forgot that when I was really getting confused by the behavior I was using an older version of samtools that didn't complain about the overflow issue and resulted in the "translocation-like" behavior I described (and this was basically in the context of doing round-tripping from sam -> bam -> sam).
Not sure if it's an abuse or not, but it's certainly handy to be able to read in the whole genome alignments into a bam-aware tool like IGV; so thanks for considering it (again)!
TODO: Fix the 28-bit limit on CIGAR length size in the internal memory structures.
While we have no immediate plans to support reads several Gb long, we do support references this long now.
However exome or tRNA style alignments (eg tophat) can produce long reference alignments by the addition of huge N ref-skip ops. Such alignments would be rare to be over 1Gb in genuine cases, but there is always room for badly mapped data and errors, plus there is potential for this to happen with circular genomes.
An example:
Samtools currently states:
We can try chopping it up, but even 1000000000N1000000000N1000000000N1000000000N1000000000N1000000000N1000000000N doesn't work due to the 28-bit size limit in BAM's cigar handling. The internals of htslib shouldn't be so wedded to the BAM on disk format and ought to be using more than 28-bits. The SAM parser at least can swallow this if we go below 28-bits. eg:
It can be written to BAM this way. Region queries pull this out OK. Also
samtools mpileup -r LONG:7000000000-8000000000 ~/tmp/long3.bam
works (after a suitably long pregnant pause).There are issues with the CRAM decoder though:
I think this is actually the same bug in a different guise. CRAM is combining multiple neighbouring op codes together, but in doing so is overflowing the internal 28-bit limit.