Open jkbonfield opened 2 years ago
Hi @jkbonfield . I much prefer the second solution: it looks more intuitive to me, but also provides the original sequence name as PN
.
Once a format is agreed, could htslib do the conversion itself ? i.e.:
The implicit vs explicit nature of this is probably outside of hts-specs itself, but it's an interesting point and agreed it's good if we can produce a nomenclature that makes that both possible and easy. I'll have to think some more on it, but it does sound doable, in much the same way we already transparently handle "alternative names".
Dear colleagues and @jkbonfield , Could these tags been used recently? Neither PN nor PO is the original tags in the SAM format. If they have not been added, what is the solution for this issue now?
Now I have successfully separated the chromosome of SAM file into two part (chr1_1: 1-1260000000, chr1_2: 1260000001-2364278061, which the original chr1 is 2364278061bp) with header {at}SQ SN:chr1 LN:1260000000 and {at}SQ SN:chr1 LN:1104278061 (Total length minus 1260000000).
For the chr1_1, everything works perfect, the SAM file can be converted to BAM with successful subsequent sort, duplicate removal, and SNPs calling.
But for chr1_2, seems like the value per se is out of range despite I have already sliced it to half, the following error still present when converting SAM to BAM by samtools view -b mySAM > myBAM, [E::bam_write1] Positional data is too large for BAM format samtools view: writing to standard output failed
It seems like just one step for me to succeed the whole things, I would really appreciate if anyone could give me some suggestions. Thank you.
On the first question, this wasn't carried forward and it's currently stalled. If there is sufficient community interest we can bring it up again and work on it. I personally think it's a good fill-in solution until something more concrete comes along (which realistically will be years I expect).
As for the second part, it's not just the length of chromosome that is the issue, but also the coordinates. It's not sufficient to simply rename the chromosomes into chr1_1 and chr1_2, you'll need to shift down all the coordinates in chr1_2 too. For ease of human processing, this is probably best done splitting at a nice round number like 1 billion so it's easy to mentally work out the original coordinates in subsequent chromosome portions.
Dear @jkbonfield , really thank for the quick reply! I will further try it. For the shift, do you mean minus 1 billion for all coordinate, such as change the positions from 1.26 billion to 0.26 billion, and also others in all chr1_2?
Given you had 1.26 billion for the first chr, I mean minus 1.26 billion for the 2nd (remembering they'll start at bp 1 and not 0 obviously). The 1 billion comment is really just a hint at an easier (for humans) place to split at rather than just in half. However given the size of your contig I can understand why you did it there as you'd rather have 2 than 3. No "correct answer" on where to split.
Dear @jkbonfield , It successfully work, really thank for your kind help and suggestions.
Just a last confirmation, If I shift the coordinate in the second part of the chromosome in SAM file (like all minus 1.26 billion in chr1_2 in the forth column in SAM file), then for the variant calling by bcftools mpileup, the chromosome 1 in reference fasta provided in -f option should also be minus 1.26 billion for the match of the shifted coordinate system in SAM from chr1_2, right?
Yes you'd need to keep everything consistent.
Really the "correct" way of doing that is to fragment your assembly into small enough chunks (ie <2GiB) and realign. That way the positions work, but also things like mate pair coordinates, augs tags such as SA, etc. That does of course assume you're using a paired-end library, or that you're doing something which cares about those tags.
The issue of dealing with chromosomes longer than 2Gbp came up again recently in samtools. Htslib/Samtools can cope with 64-bit chromosome coordinates for SAM, and also the (very drafty) CRAM 4.0, but not in standard CRAM nor BAM, both of which have signed 32-bit integers. Using SAM is a workaround, but it's not supported universally (eg htsjdk rejects such data).
To facilitate more tool chains and coping with limited formats such as BAM, one technique is to artificially break long chromosomes into smaller ones. Eg chr1a, chr1b, chr1c. However there's no standard meta-data here that can be used to indicate these form the same chromosome.
Things that we'd need to know:
@SQ
lines belong together; a name of a previous SQ field or a name of a common external assembly contig ID.An example could be:
This specifies previous-name (
PN
) linking to an earlierSN
, andPO
as an offset into that previous name so we know how the two link.PN
isn't strictly needed here, so maybe we can omit it. If we do have it then it raises questions, such as what happens if we have multiplePN
to the sameSN
at the same coordinate? That's starting to feel like alternative alleleles, which is supported byAH
, but that is strictly limited to alternate locus and banned from being used on the primary assembly contigs. The scheme above would also encode basic graphs, unless we add rules to forbid it. That's potentially useful, and also maybe a hindrance to tools that aren't aware of the headers.The alternative to a daisy chain, with all the pros and cons that go with it, is always anchoring to a primary assembly with an absolute coordinate. Eg:
That's primary-name (chr1) and primary offset. The offset fields though are now 64-bit and code reading the file would need to cope with this. This could still permit overlapping data representing alternate locii, but we shouldn't be using it that way unless it also has an associated
AH
tag. We could also add aPL
tag for the total primary length, so we know how much there is at either side of this fragment without scanning through all other header records. My gut instincts tell me this second representation is the better one, but it depends on whether we wish to permit graph construction in the SQ headers. Also perhapsA
for assembly instead ofP
in the tag naming too.What are people's thoughts on tackling this issue? It's clearly becoming a problem, especially with projects like Darwin Tree of Life, so we need to provide an appropriate way of managing the issue before people invent their own mechanisms.
Tagging @mcshane for input too.