samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
632 stars 174 forks source link

Issues with definition/use of "read" and "segment" #73

Open gerton opened 9 years ago

gerton commented 9 years ago

Dear all,

We have been going through the SAM specification with a fine comb, and found a few issues.  

The main issue is that there seems to be a confusion about ‘read’ and ‘segment’ at various places, and most importantly, the constituent parts of a chimeric alignment are not currently chained together, so that to find all records of a chimeric alignment you’ll have to read the entire BAM file.

Another, smaller issue is that flag bit 0x1 appears to be superfluous given the state of 0x40 and 0x80; and in the proposed spec, you cannot work out whether a template has multiple reads (rather than segments) contributing to it, other than by recovering all records.

Just to be clear, we think that the intended concepts of template, read and segment are:

and that a SAM record refers to an alignment of a segment.

Here is a list of proposed edits to the spec:

Page 2, section 1.2:

Edit: Linear alignment: an alignment of a segment to a single reference sequence ….

Page 2, section 1.2:

Edit: Chimeric alignment: an alignment of a read that cannot be represented as a single linear alignment. […] All the SAM records in a chimeric alignment have the same QNAME ~~and the same values for 0x40 and ox80 tags (see Section 1.4)~~ The 0x40 and 0x80 flag definitions refer to segments rather than reads, so different segments should be able to have the appropriate flags.

Page 2, section 1.2:

Edit: Multiple mapping: replace “read” by “segment” throughout.

Page 4, section 1.4:

Col 7, RNEXT […]  Ref. name of the mate / next segment Having RNEXT/PNEXT point to the next segment rather than the next read, allows all SAM records related to a single read to be efficiently recovered from a large BAM file. In addition, the concept of a "mate" is never defined, and it refers to a read not a segment.

Col 8, PNEXT […]  Position of the mate / next segment

Page 4, section 1.4. As it stands, bit 0x1 signifies whether a template has multiple segments.  If it does not, the first segment is also the last, and therefore bits 0x40 and 0x80 are both set; and if they’re both set, there can only be one segment.  Therefore bit 0x1 is redundant.  If its meaning is changed into “template having multiple reads in sequencing”, it allows users to determine whether a template has multiple reads; this information can be inferred from the whole set of chimeric reads, but these first need to be obtained from the BAM file.  In addition, the proposed meaning is identical to the current meaning of this bit.

Page 4, section 1.4. For each read / contig in a SAM file, it is required that one and only one […]. The concept of a "contig" is not defined.

Page 5, section 1.4. “If 0x40 and 0x80 are both set, the read is part of a linear template, but is neither the first nor the last read”.

This does not seem to make sense — this redefines the meaning of 0x40 == 0x80 == 1 to be the same as 0x40 == 0x80 == 0, for no reason that is apparent to us.  It also clashes with the proposed redefinition of the meaning of the 0x1 bit.

Page 5, section 1.4, “…This may happen for a chimeric alignment or if the index is lost in data processing”.  

Page 5, section 1.4, “If 0x1 is unset, no assumption can be made about 0x2, 0x8, 0x20, 0x40 and 0x80”.  If the proposed redefinition of the meaning of 0x1 is accepted, this line should be removed altogether, as single reads can still consist of multiple segments.

Page 5, section 1.4. “RNEXT: Reference sequence name of the primary alignment of the NEXT segment in the template.  For the last segment, the next segment is the first segment in the template. […]

Page 5, “PNEXT: Postition of the primary alignment of the NEXT segment in the template. […]

Gerton and Daniel.

jkbonfield commented 9 years ago

Thanks for the comments. I haven't time to study them in detail yet but I think they cover some of the same issues I was struggling with when implementing CRAM.

On Wed, Mar 11, 2015 at 10:02:58AM -0700, gerton wrote:

Another, smaller issue is that flag bit 0x1 appears to be superfluous given the state of 0x40 and 0x80; and in the proposed spec, you cannot work out whether a template has multiple reads (rather than segments) contributing to it, other than by recovering all records.

I'm not so convinced on your 0x1 point. Having both 0x40 and 0x80 set makes sense as an indicator that this read is internal and is neither the first nor the last in the template. I can see how having both unset could be used instead, although this is perhaps currently being used to indicate unknown status? However I'd be wary to change that now.

As for knowing how many segments and or reads there are, see https://github.com/samtools/hts-specs/pull/53 and for their impact on supplementary reads: https://sourceforge.net/p/samtools/mailman/message/33235303/

I produced a PDF at some point with a lot of diagrams and discussion points, but went on to other things when there was very little community interest.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

dancooke commented 9 years ago

I'm not so convinced on your 0x1 point. Having both 0x40 and 0x80 set makes sense as an indicator that this read is internal and is neither the first nor the last in the template.

But why are we even talking of reads here at all? The 0x40 and 0x80 bits both refer to segments, not reads. It seems to me that the two terms are being used interchangeably, whereas in general a segment might not be a read.

gerton commented 9 years ago

On 11 Mar 2015, at 17:31, James Bonfield notifications@github.com wrote:

Thanks for the comments. I haven't time to study them in detail yet but I think they cover some of the same issues I was struggling with when implementing CRAM.

On Wed, Mar 11, 2015 at 10:02:58AM -0700, gerton wrote:

Another, smaller issue is that flag bit 0x1 appears to be superfluous given the state of 0x40 and 0x80; and in the proposed spec, you cannot work out whether a template has multiple reads (rather than segments) contributing to it, other than by recovering all records.

I'm not so convinced on your 0x1 point. Having both 0x40 and 0x80 set makes sense as an indicator that this read is internal and is neither the first nor the last in the template. I can see how having both unset could be used instead, although this is perhaps currently being used to indicate unknown status? However I'd be wary to change that now.

The bits 0x40 and 0x80 are defined as marking “first” and “last” segment of a template, in line with the current use as first/last read. Following this definition, setting both then marks a read as BOTH first and last, hence single-segment (and thus single read). I’m not sure why the current proposal is that having both set marks them as internal instead — this should logically be the interpretation of having both bits unset.

I’m aware that both unset is currently interpreted as missing information. With the proposed definition, you’d have an incomplete chain (one lacking a marked start and end). This could be explicitly allowed to encode the same thing — in this way, the “missing information” status would be encoded by the entire chain, rather than by flag bits in a single SAM record.

This is all still background. However if these interpretations are followed, it does free up bit 0x1 as its status can be inferred from 0x40/0x80. The proposal to move it back to the old definition, template has multiple READS in sequencing, has the virtue of being more conservative, as it follows current usage better. It would be a function of the experiment, rather than being a feature of in the end arbitrary decisions of the aligner.

Best Gerton

As for knowing how many segments and or reads there are, see https://github.com/samtools/hts-specs/pull/53 and for their impact on supplementary reads: https://sourceforge.net/p/samtools/mailman/message/33235303/

I produced a PDF at some point with a lot of diagrams and discussion points, but went on to other things when there was very little community interest.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. — Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/73#issuecomment-78313063.

jkbonfield commented 9 years ago

Quite so, I agree with the read vs segment view. Indeed I previously wrote "However 0x40 and 0x80 flags elsewhere are defined to be the first and last segment, not the first and last read. This is at odds with chimeric reads having multiple segments with 0x40 and 0x80 set". For the life of me I cannot find where I posted this, just my original document! I was sure it ended up on samtools-devel somewhere. I put a copy of my RFC on an ftp site instead. See (when it deigns to update itself): ftp://ftp.sanger.ac.uk/pub/users/jkb/RFC-pnext.pdf

However in some places reads and not segements is how SAM is defined and it appears quite deliberate (and IMO wrong).

For example PNEXT/RNEXT refer to reads, not segments. To me it is incorrect, but in actual implementation within bwa Heng very explicitly treats supplementary alignments as invalid for the target of a PNEXT/RNEXT chain, even if they are a segment from a primary read alignment. This seems like a concious effort to treat the first/last segment of a read as a placeholder for the read itself.

I don't quite get the logic though as it simply makes the fields useless when dealing with supplementary data; there is no longer a way to chain through the records and find them all, or even to know how many there are. (There are SAM auxiliary fields which partially cover this, but they aren't mandatory and don't appear to be in use by aligners.)

IMO PNEXT/RNEXT should operate on logical units and ditch the whole primary bit completely. For example, even ignoring supplementary alignments and multiple segments per read, if we have a read-pair aligning to chr1 and to a duplicate region in chr2, then we have two logical pairings; one primary and one secondary. The specification however requires us to make the secondary pairing unpaired, linking back to their primary rather than the very logical other-end secondary.

jkbonfield commented 9 years ago

Ah I finally found the thread on samtools-devel, along with the whopping zero replies. :-)

https://sourceforge.net/p/samtools/mailman/message/33003353/

How well does this tie in with your suggestions?

jkbonfield commented 9 years ago

On Thu, Mar 12, 2015 at 02:42:37AM -0700, gerton wrote:

The bits 0x40 and 0x80 are defined as marking "first" and "last" segment of a template, in line with the current use as first/last read. Following this definition, setting both then marks a read as BOTH first and last, hence single-segment (and thus single read). I'm not sure why the current proposal is that having both set marks them as internal instead -- this should logically be the interpretation of having both bits unset.

I agree that it would have made sense to allow both 0x40 and 0x80 to be defined when indicating a single ended alignment and having neither set to indicate that it is neither the first, nor last, but somewhere inbetween. It's perfectly logical, but it's simply not how it has been used in the past and to change it now would cause massive chaos. It frees up bit 0x1, but repurposing bit 0x1 would be even chaos. Put simply, I wouldn't have started from here, but here we are so we have to live with it.

The main issue I see is that the chain information is basically broken and not useful. SAM grew out of assuming only 1 or 2 alignments per template, despite the fact that we routinely had > 2 in our historical data. An attempt was made to reverse this assumption by changing the mate ref/pos fields to a "next" ref/pos field, but it wasn't fully complete. This is what we need to fix fundamentally.

This is all still background. However if these interpretations are followed, it does free up bit 0x1 as its status can be inferred from 0x40/0x80. The proposal to move it back to the old definition, template has multiple READS in sequencing, has the virtue of being more conservative, as it follows current usage better. It would be a function of the experiment, rather than being a feature of in the end arbitrary decisions of the aligner.

I think practically speaking 0x1 already does mean multiple READS, despite what the specification states. I just tested bwa-mem on a split segment alignment of a single ended run and surely enough it doesn't set 0x1 as it is a single read. Given it was almost certainly Heng who changed that part of the spec from read to segment, my instinct tells me that this is a search/replace mistake rather than an intentional change in meaning. Are there any other aligners out there producing split reads that changed to using segments for this flag?

I'd certainly prefer it back as read again.

James

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

gerton commented 9 years ago

On 12 Mar 2015, at 09:55, James Bonfield notifications@github.com wrote:

Quite so, I agree with the read vs segment view. Indeed I previously wrote "However 0x40 and 0x80 flags elsewhere are defined to be the first and last segment, not the first and last read. This is at odds with chimeric reads having multiple segments with 0x40 and 0x80 set". For the life of me I cannot find where I posted this, just my original document! I was sure it ended up on samtools-devel somewhere. I put a copy of my RFC on an ftp site instead. See (when it deigns to update itself): ftp://ftp.sanger.ac.uk/pub/users/jkb/RFC-pnext.pdf

However in some places reads and not segements is how SAM is defined and it appears quite deliberate (and IMO wrong)

For example PNEXT/RNEXT refer to reads, not segments. To me it is incorrect, but in actual implementation within bwa Heng very explicitly treats supplementary alignments as invalid for the target of a PNEXT/RNEXT chain, even if they are a segment from a primary read alignment. This seems like a concious effort to treat the first/last segment of a read as a placeholder for the read itself.

I don't quite get the logic though as it simply makes the fields useless when dealing with supplementary data; there is no longer a way to chain through the records and find them all, or even to know how many there are. (There are SAM auxiliary fields which partially cover this, but they aren't mandatory and don't appear to be in use by aligners.)

That is precisely our thinking — it’s essentially impossible to obtain the supplementary records for a chimeric alignment, short of linearly searching through the BAM. Suppose we align a single long Oxford Nanopore read which overlaps a complicated structural variant which requires several supplementary records to represent the chimeric alignment; upon interrogating the locus where the SV occurs, I will need to follow the chain of alignments to see what the mapper has made of it.

I’m guessing that the original idea was to have a primary SAM record that contains all information to reconstruct the sequence data — this record would list the full sequence, and soft-clip the parts not in the alignment. Supplementary records could hard-clip the bits they don’t refer to. In this way, you can reconstruct all reads completely by stepping through the linked list. However, you cannot recover the alignment computed by the mapper.

IMO PNEXT/RNEXT should operate on logical units and ditch the whole primary bit completely. For example, even ignoring supplementary alignments and multiple segments per read, if we have a read-pair aligning to chr1 and to a duplicate region in chr2, then we have two logical pairings; one primary and one secondary. The specification however requires us to make the secondary pairing unpaired, linking back to their primary rather than the very logical other-end secondary.

Yes, I agree. In my opinion, the function of the PNEXT/RNEXT is to establish a circular linked list, linking all segments of a chimeric alignment together in their canonical order. (When reads align linearly, they link reads together in their canonical order, as they do now).

A secondary alignment establishes a completely independent set of alignments of segments; in order to be able to extract this information from the BAM efficiently these should also be linked together, in the same way. This would imply they should NOT be linking back to the other, primary, alignment.

In practice I think this would work; most algorithms would ignore any secondary alignment (so where they point to is immaterial); if algorithms do want to extract secondary alignments, they would do so on the basis of seeing the first record of a secondary alignment, and would then walk down the linked list and pick up the other associated records.

Best wishes Gerton

— Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/73#issuecomment-78451026.

dancooke commented 9 years ago

How well does this tie in with your suggestions?

It seems to me this is addressing more an issue with chaining secondary alignments rather than segment chaining. We hadn't actually considered the problems you raise with secondary chaining, but the idea of multiple 'template alignments' rather than secondary segments does make sense to me.

I'm a little confused by your conclusion of "works as-is" for point E; does this example not suffer the same segment vs read issue discussed here?

For example PNEXT/RNEXT refer to reads, not segments. To me it is incorrect, but in actual implementation within bwa Heng very explicitly treats supplementary alignments as invalid for the target of a PNEXT/RNEXT chain, even if they are a segment from a primary read alignment. This seems like a concious effort to treat the first/last segment of a read as a placeholder for the read itself.

I don't quite get the logic though as it simply makes the fields useless when dealing with supplementary data; there is no longer a way to chain through the records and find them all, or even to know how many there are. (There are SAM auxiliary fields which partially cover this, but they aren't mandatory and don't appear to be in use by aligners.)

Agreed, this doesn't make sense to me (and is our main concern), I'm not sure it even makes sense to generally enforce a specific ordering of mapped segments within a read. Of course, if we chain segments we still need to be able to distinguish reads, but this could easily be determined by enforcing the 'supplementary alignment' bit refer to the (arbitrary) head segment of the read (perhaps it would make more sense to be named something like 'tail segment'; 'supplementary' is a little confusing to me). The position of the head segment of the next read could be an optional field if chaining through read segments is an issue.

lh3 commented 9 years ago

Early sam spec recommended to chain segments with RNEXT/PNEXT. Practically I found it is awkward to implement and to use. The current recommendation is to put all supplementary alignments in the SA tag. You can find where each segment on a read is mapped from a single SAM line. Bwa-mem has been doing this for a couple of years. The downside is this wastes space, which is quite significant for pacbio subeads.

jkbonfield commented 9 years ago

On Thu, Mar 12, 2015 at 06:45:27AM -0700, Heng Li wrote:

Early sam spec recommended to chain segments with RNEXT/PNEXT. Practically I found it is awkward to implement and to use. The current recommendation is to put all supplementary alignments in the SA tag. You can find where each segment on a read is mapped from a single SAM line. Bwa-mem has been doing this for a couple of years. The downside is this wastes space, which is quite significant for pacbio subeads.

That's a good point. Practically speaking using an SA tag is easier as every record points to every other record allowing easy traversal. However maybe this is something we'd only need on the primary fragment with all supplementary fragments pointing back to the primary, such that it's just one hop away to get everything without having an N^2 growth for an N-fragment template.

Also I think it is orthoganol to the issue of how PNEXT/RNEXT work. There is no reason why these cannot form multiple distinct chains (for primary and secondary alignments) and still keep SA as an alternative mechanism.

James

James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

jkbonfield commented 9 years ago

I think some of the misunderstanding here comes from the terms. We have a glossary, but some of it has changed over time and I think I may partially remember earlier definitions which consues things.

We need to be very clear here to make sure we're interpreting it correctly. I think what is intended is the following.

Template: a biological entity - the fragment of DNA that you are sequencing. In a traditional Sanger sequencing sense, this was a bit of sheared DNA cut out of a gel and inserted into m13mp18, puc19 or a similar sequencing vector. Sometimes also referred to as "insert". For Illumina the template will be the entire DNA fragment that has been bridge-amplified.

Read: A sequence from a sequencing instrument. It may produce 1 or more. Illumina paired end libraries produce 2 reads from the template (which may possibly overlap).

A 454 mate pair library worked by circularising the template with a biotin adapter, randomly sheering that, selecting/enriching for the biotin, adding adapters (is this now a new template? Probably) and then producing one single read that sequenced both ends plus the internal biotin adapter. (You can do this physicalling by rolling a sheet of paper up so the bottom edge touches the top edge and then drawing an arrow crossing the boundary; it neatly demonstrates both ends should align in the same orientation.)

Complete genomics (IIRC) had the notion of strobed reads, where the base-detection would be pulsed on and off giving a single read covering several disjoint stretches of bases on the template

Segment: A subsection of a read. In the 454 mate-pair case we have one read and two segments. For complete genomics it could be many segments. Previously this used to be referred to as fragment.

Now where it gets woolly is supplementary alignments, eg due to rearrangements. Each of those supplementary alignments can contain a subsection of the full read. Is it therefore a segment? If so we have two distinct meanings for the word segment, which is confusing.

The fundamental question which is not answered by the specification is what it is that we have per line of a SAM file. Are the sequences there segments, or something else? I guess they are alignments, but the definition of the various alignment descriptions use the word read and not segment, as does the section on multiple mappings.

For sure the specification is not internally consistent. It's been a bugbear of mine that RNEXT/PNEXT refer to reads while FLAG 0x8 refers to a segment, meaning that we cannot reliably determine the orientation of the thing referred to by RNEXT/PNEXT.

James

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

dancooke commented 9 years ago

That's a good point. Practically speaking using an SA tag is easier as every record points to every other record allowing easy traversal. However maybe this is something we'd only need on the primary fragment with all supplementary fragments pointing back to the primary, such that it's just one hop away to get everything without having an N^2 growth for an N-fragment template.

I don't immediately see the benefit of this approach over the approach I suggested above. Using this approach the SA tag is essential to avoid scanning the entire file for supplementary segments, whereas using segment chaining you don't need any optional fields to avoid doing this.

While this approach allows you to revisit supplementary segments, you could emulate this in code easily enough if needed, and presumably if you wanted this kind of behaviour you'd need to record which segments had been visited anyway.

Other than that it seems illogical to me to designate one segment as 'representative', and blurs the lines somewhat with multiple mappings.

dancooke commented 9 years ago

I think some of the misunderstanding here comes from the terms.

Almost certainly. Here's my take:

Template: a biological entity - the fragment of DNA that you are sequencing.

Agreed, although if being pedantic I might extend to something like "A DNA/RNA sequence of which one or more parts, that have known spatial properties in the sample being sequenced, will be represented in this file".

Read: A sequence from a sequencing instrument. It may produce 1 or more. Illumina paired end libraries produce 2 reads from the template (which may possibly overlap).

I agree.

Segment: A subsection of a read. In the 454 mate-pair case we have one read and two segments. For complete genomics it could be many segments.

As you say, we need to be careful here. My understanding is that while a segment is indeed a subsequence of a read (that may be the entire read), it is a consequence of the mapper rather than the sequencer. While it's the case that 454 reads contain two distinct physically separated subsequences, if the mapper incorrectly (or correctly if a SV has occurred) maps the entire read to one locus, I would say that read has a single segment.

If these are the correct interpretations, then the definitions we gave above are all sound (I think). In particular it is clear that each SAM record expresses what the mapper considers a spatially distinct subsequence of a read, while chaining expresses our prior knowledge of the spatiality of segments within reads, and reads within a template.

plijnzaad commented 8 years ago

Dear all,

I ended up at this discussion after getting confused with the phrase "A read may consist of multiple segments.". I found (this post by Istvan Albert](https://www.biostars.org/p/120461/) quite illuminating, and it seems consistent with the proposals above. However, in the proposed wording there is one thing that is still not clear to me:

In the 454 mate-pair case we have one read and two segments.

Most users seem to consider a read 'one SAM line', also if it concerns paired-end data. It would be nice to make it more explicit if a read is 'one SAM line', or if one read is 'all the segments belonging to one template'.

EDIT: I see that samtools stat output (and many places elsewhere) has the same confusion: 'read length' is in fact the length of contiguously sequenced bits of the template. (BTW samtools stat also renames segment to fragment, in the section 'summary numbers' it says 1st fragments)

Cheers, Philip

jrobinso commented 8 years ago

I see an interesting discussion here with no conclusions, any updates on these fields? I'm trying to update IGV to properly support chimeric alignments. I realize this is a spec discussion, and I need to support bams that are being produced by tools, and interpretations might differ.

d-cameron commented 8 years ago

From what I understand, this is still an outstanding issues with different tools still making different assumptions. The pathological edges cases in which some alignments cannot be unambiguously represented (see https://sourceforge.net/p/samtools/mailman/message/33054794/ ) have not been resolved.