Issues with long ref and ref-skips

jkbonfield commented 4 years ago

TODO: Fix the 28-bit limit on CIGAR length size in the internal memory structures.

While we have no immediate plans to support reads several Gb long, we do support references this long now.

However exome or tRNA style alignments (eg tophat) can produce long reference alignments by the addition of huge N ref-skip ops. Such alignments would be rare to be over 1Gb in genuine cases, but there is always room for badly mapped data and errors, plus there is potential for this to happen with circular genomes.

An example:

@SQ SN:LONG LN:8000000000
SRR065390.14978392  16  LONG    2   1   27M7000000000N73M   *   0   0   CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA    #############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

Samtools currently states:

$ samtools view ~/tmp/long.sam
[E::sam_parse1] numeric value out of allowed range
[W::sam_read1] Parse error at line 2
[main_samview] truncated file.

We can try chopping it up, but even 1000000000N1000000000N1000000000N1000000000N1000000000N1000000000N1000000000N doesn't work due to the 28-bit size limit in BAM's cigar handling. The internals of htslib shouldn't be so wedded to the BAM on disk format and ought to be using more than 28-bits. The SAM parser at least can swallow this if we go below 28-bits. eg:

@SQ SN:LONG LN:8000000000
SRR065390.14978392  16  LONG    2   1   27M200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N73M    *   0   0   CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA    #############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

It can be written to BAM this way. Region queries pull this out OK. Also samtools mpileup -r LONG:7000000000-8000000000 ~/tmp/long3.bam works (after a suitably long pregnant pause).

There are issues with the CRAM decoder though:

$ samtools view ~/tmp/long3.cram
SRR065390.14978392  16  LONG    2   1   27M20678144N73M *   0   0   CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA    #############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

I think this is actually the same bug in a different guise. CRAM is combining multiple neighbouring op codes together, but in doing so is overflowing the internal 28-bit limit.

adf-ncgr commented 3 years ago

FWIW, this same thing happens when representing whole genome alignments (e.g. as produced by minimap2) in bam; if the query sequence is > 2^28 then the clipping operations that indicate how far along the query sequence the alignment started may be wrong, causing some problematic behavior when trying to interpret the alignments in a dotplot context, where both query and ref positions for each aligned block are used (ends up looking like a translocation of the latter part of the query chromosome back to the beginning). Granted, chromosome sequences aren't exactly "reads" but it is handy to be able to use bam to represent whole-genome alignments for use with other tools. So +1 to addressing this!

jkbonfield commented 3 years ago

https://github.com/samtools/htslib/pull/976 was an attempt to fix this for CRAM, but I forget why we rejected it as an idea.

(I think trying to represent whole genome alignments in BAM is perhaps a bit of an abuse, but my PR was to fix something more mundane and realistic - TopHat alignments on non-human genomes.)

Maybe I'll resurrect the idea, and put in logic for the SAM parser too.

adf-ncgr commented 3 years ago

Thanks for the quick reply and agreed (per #976) that the error handling has been much improved- I forgot that when I was really getting confused by the behavior I was using an older version of samtools that didn't complain about the overflow issue and resulted in the "translocation-like" behavior I described (and this was basically in the context of doing round-tripping from sam -> bam -> sam).

Not sure if it's an abuse or not, but it's certainly handy to be able to read in the whole genome alignments into a bam-aware tool like IGV; so thanks for considering it (again)!

samtools / htslib

Issues with long ref and ref-skips #971