SAM/BAM/CRAM long chromosomes

jkbonfield commented 2 years ago

The issue of dealing with chromosomes longer than 2Gbp came up again recently in samtools. Htslib/Samtools can cope with 64-bit chromosome coordinates for SAM, and also the (very drafty) CRAM 4.0, but not in standard CRAM nor BAM, both of which have signed 32-bit integers. Using SAM is a workaround, but it's not supported universally (eg htsjdk rejects such data).

To facilitate more tool chains and coping with limited formats such as BAM, one technique is to artificially break long chromosomes into smaller ones. Eg chr1a, chr1b, chr1c. However there's no standard meta-data here that can be used to indicate these form the same chromosome.

Things that we'd need to know:

Which @SQ lines belong together; a name of a previous SQ field or a name of a common external assembly contig ID.
What the start coordinate is on the original assembly chromosome, or a relative offset to a previous SQ line. This could also be implicit (a running total, so they butt up end to end in the order defined in SQ header), but we may wish to have overlap so we don't need to truncate alignments. In this case we'd need an offset into the previous chromosome.
Optionally what total length is the original assembly chromosome.

An example could be:

@SQ SN:1a   LN:2000000000
@SQ SN:1b   LN:2000000000   PN:1a   PO:1999999850
@SQ SN:1c   LN:123456789    PN:1c   PO:1999999850

This specifies previous-name (PN) linking to an earlier SN, and PO as an offset into that previous name so we know how the two link. PN isn't strictly needed here, so maybe we can omit it. If we do have it then it raises questions, such as what happens if we have multiple PN to the same SN at the same coordinate? That's starting to feel like alternative alleleles, which is supported by AH, but that is strictly limited to alternate locus and banned from being used on the primary assembly contigs. The scheme above would also encode basic graphs, unless we add rules to forbid it. That's potentially useful, and also maybe a hindrance to tools that aren't aware of the headers.

The alternative to a daisy chain, with all the pros and cons that go with it, is always anchoring to a primary assembly with an absolute coordinate. Eg:

@SQ SN:chr1a    LN:2000000000   PN:chr1 PO:1
@SQ SN:chr1b    LN:2000000000   PN:chr1 PO:1999999850
@SQ SN:chr1c    LN:123456789    PN:chr1 PO:3999999700

That's primary-name (chr1) and primary offset. The offset fields though are now 64-bit and code reading the file would need to cope with this. This could still permit overlapping data representing alternate locii, but we shouldn't be using it that way unless it also has an associated AH tag. We could also add a PL tag for the total primary length, so we know how much there is at either side of this fragment without scanning through all other header records. My gut instincts tell me this second representation is the better one, but it depends on whether we wish to permit graph construction in the SQ headers. Also perhaps A for assembly instead of P in the tag naming too.

What are people's thoughts on tackling this issue? It's clearly becoming a problem, especially with projects like Darwin Tree of Life, so we need to provide an appropriate way of managing the issue before people invent their own mechanisms.

Tagging @mcshane for input too.

muffato commented 2 years ago

Hi @jkbonfield . I much prefer the second solution: it looks more intuitive to me, but also provides the original sequence name as PN. Once a format is agreed, could htslib do the conversion itself ? i.e.:

its API would be using 64-bits
reading/writing SAM would be as is
writing BAM would do an implicit split every ~512 MB, and reading would merge the chunks back
writing CRAMv3 would do an implicit split every ~2 GB, and reading would merge the chunks back

jkbonfield commented 2 years ago

The implicit vs explicit nature of this is probably outside of hts-specs itself, but it's an interesting point and agreed it's good if we can produce a nomenclature that makes that both possible and easy. I'll have to think some more on it, but it does sound doable, in much the same way we already transparently handle "alternative names".

JuiTse commented 1 year ago

Dear colleagues and @jkbonfield , Could these tags been used recently? Neither PN nor PO is the original tags in the SAM format. If they have not been added, what is the solution for this issue now?

Now I have successfully separated the chromosome of SAM file into two part (chr1_1: 1-1260000000, chr1_2: 1260000001-2364278061, which the original chr1 is 2364278061bp) with header {at}SQ SN:chr1 LN:1260000000 and {at}SQ SN:chr1 LN:1104278061 (Total length minus 1260000000).

For the chr1_1, everything works perfect, the SAM file can be converted to BAM with successful subsequent sort, duplicate removal, and SNPs calling.

But for chr1_2, seems like the value per se is out of range despite I have already sliced it to half, the following error still present when converting SAM to BAM by samtools view -b mySAM > myBAM, [E::bam_write1] Positional data is too large for BAM format samtools view: writing to standard output failed

It seems like just one step for me to succeed the whole things, I would really appreciate if anyone could give me some suggestions. Thank you.

jkbonfield commented 1 year ago

On the first question, this wasn't carried forward and it's currently stalled. If there is sufficient community interest we can bring it up again and work on it. I personally think it's a good fill-in solution until something more concrete comes along (which realistically will be years I expect).

As for the second part, it's not just the length of chromosome that is the issue, but also the coordinates. It's not sufficient to simply rename the chromosomes into chr1_1 and chr1_2, you'll need to shift down all the coordinates in chr1_2 too. For ease of human processing, this is probably best done splitting at a nice round number like 1 billion so it's easy to mentally work out the original coordinates in subsequent chromosome portions.

JuiTse commented 1 year ago

Dear @jkbonfield , really thank for the quick reply! I will further try it. For the shift, do you mean minus 1 billion for all coordinate, such as change the positions from 1.26 billion to 0.26 billion, and also others in all chr1_2?

jkbonfield commented 1 year ago

Given you had 1.26 billion for the first chr, I mean minus 1.26 billion for the 2nd (remembering they'll start at bp 1 and not 0 obviously). The 1 billion comment is really just a hint at an easier (for humans) place to split at rather than just in half. However given the size of your contig I can understand why you did it there as you'd rather have 2 than 3. No "correct answer" on where to split.

JuiTse commented 1 year ago

Dear @jkbonfield , It successfully work, really thank for your kind help and suggestions.

Just a last confirmation, If I shift the coordinate in the second part of the chromosome in SAM file (like all minus 1.26 billion in chr1_2 in the forth column in SAM file), then for the variant calling by bcftools mpileup, the chromosome 1 in reference fasta provided in -f option should also be minus 1.26 billion for the match of the shifted coordinate system in SAM from chr1_2, right?

jkbonfield commented 1 year ago

Yes you'd need to keep everything consistent.

Really the "correct" way of doing that is to fragment your assembly into small enough chunks (ie <2GiB) and realign. That way the positions work, but also things like mate pair coordinates, augs tags such as SA, etc. That does of course assume you're using a paired-end library, or that you're doing something which cares about those tags.

samtools / hts-specs

SAM/BAM/CRAM long chromosomes #655