igvteam / igv

Integrative Genomics Viewer. Fast, efficient, scalable visualization tool for genomics data and annotations
https://igv.org
MIT License
647 stars 386 forks source link

Improve treatment of supplementary alignments #298

Closed jrobinso closed 5 years ago

jrobinso commented 8 years ago

This issue is for general discussion of treatment supplementary alignments (bam tag 0x800 & SA tag). See also #277 for discussion of connecting supplementary alignments. @amwenger @pb-jchin @cwhelan @SHuang-Broad

jrobinso commented 8 years ago

From @amwenger wrt popup text:

"Perhaps we could list each SA alignment on its own line listing the read span, reference span, strand, and MAPQ (and perhaps even NM). One format would be “read 3,157-10,123 to chr6:43,143,415-43,149,942 (-) @ MAPQ 60 NM 763” which would mean that bases 3,157-10,123 of the read align to the specified reference location at the listed MAPQ and NM."

jrobinso commented 8 years ago

From @cwhelan

Being able to link the primary and supplementary alignments would be great. Since they could overlap with one another this might even work better if this worked similarly to a reads->group by->read name feature.

The other thing with long reads/contigs is that it'd be nice to be able to visualize how the different supplementary alignments relate to each other. For example, two different supplementary alignments might overlap with one another on the contig. Or, there might be a region of the contig that's not aligned in between two supplementary alignments. These can all mean different things in terms of structural variation.

Some examples:

1)

Read/Contig: |-------*------*@----@-----|

Region between the stars aligns to chr1:1000-2000. Region between the at signs aligns to chr1:3000-4000.

This indicates that there's a deletion of chr1:2000-3000.

2)

Read/Contig: |-------*------*---@----@-----|

Region between the stars aligns to chr1:1000-2000. Region between the at signs aligns to chr1:3000-4000. Region in the middle between the * and the @ is not covered by a primary/supplementary alignment.

This indicates that there's a deletion of chr1:2000-3000, and that at the site of the deletion that middle sequence between the * and the @ has been inserted.

2)

Read/Contig: |-------*---@---*-@-----|

Region between the stars aligns to chr1:1000-2000. Region between the at signs aligns to chr1:3000-4000. The alignments overlap in read/contig coordinates.

This indicates that there's a deletion somewhere around chr1:2000-3000, but the reference sequence flanking each end of the deletion is homologous (such that the sequence on the contig between the @ and the * aligns well to both regions). We know there's a 1kb deletion but the start and end point can be shifted within the bounds of the homologous sequence.

Anyway, some sort of view that could list, lay out visually, or allow you to navigate between consecutive non-secondary alignments in read/contig coordinate order would be helpful.

jrobinso commented 8 years ago

More from @cwhelan

I think it does make sense to link them somehow. My only caveat is that if possible it might be best not to give too 'linear' an impression of them, since the supplementary alignments won't necessarily be linear along the read. For example if a read has four alignments:

|----|----|-----|----| A B C D

And regions A and D are on the same reference chromosome, but in reference coordinates D comes before A, and B and C are on different chromosomes, a simple line linking A and D might be misleading. Does that make sense? This sort of thing could happen in cancer genomes, and often contigs will have mobile element insertions that map to other mobile elements in the reference, so these type of alignments come up with some regularity even in germline data.

amwenger commented 8 years ago

@cwhelan makes good points about the complexity of displaying SVs in reference aligned read pileups. For the more complex situations, I think the best value that IGV can provide is flagging the region so that it is clear that some SV event is present there. The visualization may be better done in a more specialized tool that ideally would be integrated with IGV (e.g. click on a read to view in SV tool).

jrobinso commented 8 years ago

I agree with @amwenger, more complex situations should probably be rendered in a separate linked visualization. We can only push the standard alignment view so far without it getting weird for the normal case. Overlapping alignments are particularly hard in the standard view.

jrobinso commented 8 years ago

@amwenger -- re Supplementary tag info, how does one determine the following? I don't see information on the read sequence

"mean that bases 3,157-10,123 of the read align to the specified reference location"

amwenger commented 8 years ago

@jrobinso - That information can be obtained from the CIGAR string component of the SA entry. Clipping gives the read start position and the sum of match and insertion events give the number of read bases.

As an example (PacBio data so long CIGAR):

SA = chr6,43173849,+,4420S12M1I13M3D15S,52,74

CIGAR:
    4420S    read start position is 4421 (strand is "+")
      12M    12 read bases
       1I    +1 = 13 read bases
      13M    +13 = 26 read bases
       3D
      15S

So that would say read 4,421-4,447 for this example
jrobinso commented 8 years ago

So is the CIGAR string of the SA entry relative to the read sequence of the current alignment? I thought that was the actual CIGAR of the corresponding alignment. If the read sequence is copied in all connected alignments the 2 would be the same, but I didn't think this was the case. The spec is ambiguous on it.

On 8/12/16 3:57 PM, amwenger wrote:

@jrobinso https://github.com/jrobinso - That information can be obtained from the CIGAR string component of the SA entry. Clipping gives the read start position and the sum of match and insertion events give the number of read bases.

As an example (PacBio data so long CIGAR):

|SA = chr6,43173849,+,4420S12M1I13M3D15S,52,74 CIGAR: 4420S read start position is 4421 (strand is "+") 12M 12 read bases 1I +1 = 13 read bases 13M +13 = 26 read bases 3D 15S So that would say read 4,421-4,447 for this example |

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/298#issuecomment-239579437, or mute the thread https://github.com/notifications/unsubscribe-auth/AA49HH1KzXjT-Ypbp06kdhbMzV1uX7TYks5qfPpkgaJpZM4JdMOG.

amwenger commented 8 years ago

I do not know for certain, but I think that an SA entry uses exactly what you say - the CIGAR copied from the other alignment. My point about strand is that different read alignments may be on different strands and thus some CIGARs will be relative to the "native" read orientation and some to the reverse complement. When reporting read coordinates, I think it makes most sense to use the native read orientation.

jrobinso commented 8 years ago

I'm looking at some PacBio alignments from the Broad and they don't all follow this convention. Some are hard clipped, some are soft clipped. The important point would be are all cigars relative to the same read sequence (the "original" sequence)? Nothing in the spec says this has to be so, and I know for one technology in use here it is not. Rather, the read sequence is just split up among the various pieces with no clipping information (i.e. no hard clip on the cigar). So maybe if clip (hard or soft, or both) are specified we can ASSUME they are in reference to the original read sequence, if missing we can't know.

amwenger commented 8 years ago

Thanks for the explanation.

Another alternative to assuming a convention is to leave out the read coordinate portion of my proposed format; that is, just say "chr6:43,143,415-43,149,942 (-) @ MAPQ 60 NM 763". The CIGAR will still be needed to get the REF end coordinate, but that does not depend on any strand convention.

amwenger commented 8 years ago

@jrobinso I have been utilizing this new "Link Supplementary Alignments" feature; it is very useful. I have a few suggestions to make it even better:

  1. Label the line that connects alignments with the size of the gap between the two alignments just as we do for deletions.
  2. Rather than coloring the full alignments when there is strand disagreement, just color the connecting line and only when the two adjacent alignments being connected are on different strands. The coloring is a bit too common.
  3. For paired end alignments (i.e. most Illumina data), this feature is connecting the paired reads. Instead, it should only connect primary and supplementary alignments for a single end read. Perhaps require the 0x40 flag (isFirstOfPair attribute) to match in order to connect alignments from a paired read.
jrobinso commented 8 years ago

@amwenger thanks for the feedback, very helpful

RE 1) -- we could do that. However, currently the connecting line is a little tenuous, I don't have any code to order the alignments correctly or provide any interpretation for the connecting line -- i.e. the alignments might not be "connected" at all, they just happen to be adjacent, so I'm a bit hesitant to emphasize the line or equate it to a deletion line which we do know how to interpret (i.e. it is explicitly marked with a D operator).

RE 2) -- why would the coloring be common (why is it common to have mixed strands)? If you recall I asked about this after viewing the 1KG pacbio file you gave me for testing, I think the source was NCBI. We decided this was some issue with the file or process that produced it. The pacbio file I'm using from the Broad does not have this issue, its very clean in this regard. In any event if its common its not useful, rather than just coloring the connecting line I think there should be an option to turn it off. We could generalize "color by pair orientation" to include this, and you can switch it off as you do any other color option. The line is not always visible, e.g. with inversions the alignments are butted together, and I think that is too subtle.

RE 3) -- that should not happen, this is how the linking "readname" is constructed to link, and it is working correctly with my Illumina files. Are you sure you are using a recent version?

                bc = a.getReadName();
                if(a.isPaired()) {
                    bc += a.isFirstOfPair() ? "/1" : "/2";
                }
amwenger commented 8 years ago

Thank you for the responses, Jim. I think you are correct on all fronts.

  1. Excellent point about the line denoting adjacency in reference space and not necessarily deletion.
  2. You're correct that the prolific coloring in the file that I sent you was a problem with NCBI representation of PacBio reads. The less common but still present coloring in my current BAM is likely from my attempt to use very sensitive alignment parameters, which I think creates local false positive alignments. Anyway, leave it the way that you have it for now. I will comment again if I have stronger evidence that coloring is too common to be useful.
  3. I updated from your branch and the issue is resolved as you suggest.
jrobinso commented 8 years ago

RE 2) I still think it should be an option, perhaps on by default when you select linking. The option would appear with the other "color by" entries.

sjackman commented 7 years ago

Hi, Jim. Would you consider adding a View supplementary alignment region in split screen option similar to View mate region in split screen? Or alternatively, a View / Split screen menu option? http://software.broadinstitute.org/software/igv/AlignmentData#splitscreen

SHuang-Broad commented 7 years ago

I support @sjackman's proposal because for large-sized SV, the supplementary alignments might be located quite far away and it would be convenient to have such a feature.

sjackman commented 7 years ago

Or on a different chromosome, or in my case with de novo assemblies, on a different scaffold.

jrobinso commented 7 years ago

Yes, good idea, so use the SA tag to bring up a view of each region, and maybe highlight the chosen read? I haven't verified exhaustively but I assume every alignment from a split read has the SA tag, including the "primary" one.

On Wed, May 31, 2017 at 12:25 PM, Shaun Jackman notifications@github.com wrote:

Or on a different chromosome, or in my case with de novo assemblies, on a different scaffold.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/298#issuecomment-305291226, or mute the thread https://github.com/notifications/unsubscribe-auth/AA49HDZZJFzbJjJWN6-yrBcVxgh0UJSkks5r_b6lgaJpZM4JdMOG .

sjackman commented 7 years ago

Great! Yes, correct.

SA:Z:(rname ,pos ,strand ,CIGAR ,mapQ ,NM ;)+ Other canonical alignments in a chimeric alignment, for- matted as a semicolon-delimited list. Each element in the list represents a part of the chimeric align- ment. Conventionally, at a supplementary line, the first element points to the primary line.

https://github.com/samtools/hts-specs/blob/master/SAMtags.pdf

SHuang-Broad commented 7 years ago

@jrobinso @sjackman , just to clarify (both of you mostly likely already know this): the SA tags for all the chimeric alignments of a sequence are not the same. An alignment is not going to put itself into its SA tag.

SHuang-Broad commented 7 years ago

In the mean time, I've been looking into complex SV's. And one of the signature of complex SV is that a long read (might be an locally assembled sequence) might produce a large collection of alignments (I've seen up to 10), but some of them have low mapping quality and short alignment length. It might worth thinking about how such small "noise" could be filtered.

EDIT

One example is here

sjackman commented 7 years ago

Not just for chimeric alignments, but I've for a long time wished for a feature to filter alignments displayed by IGV by mapping quality (MAPQ), alignment score (AS:i or AS:f for Long Ranger), and number of mismatches (NM:i). For example, AS ≥ 100 & NM < 5. Filtering by percent identity or percent query coverage could be useful too, but trickier since these don't usually have SAM tags.

jrobinso commented 5 years ago

@sjackman @amwenger @SHuang-Broad This is a long thread with many good suggestions. However its 2 years old. I'm trying to clear up open IGV issues. If any of the specific suggestions here are still relevant, I suspect they are, could you indicate it, or better open an issue for that suggestion and reference this issue.

oneillkza commented 4 years ago

@jprobinso, just noting that Shaun's suggestion of a "split screen by supplementary alignment" is also one I've been asking for, and would definitely be very useful! https://github.com/igvteam/igv/issues/298#issuecomment-305267693

jrobinso commented 4 years ago

@oneillkza I just opened an explicit issue for this. All the issues tagged "3rd gen" are up next for IGV development. As you may or may not know I also do the igv.js and juicebox.js development, so I rotate, IGV is up again next week and hope to release these by the end of the month.

oneillkza commented 4 years ago

Thanks @jprobinso, that's great news!

PS: I do still have it on my to-do list to get you a small example Nanopore bam file. I'm also a little snowed under right now!

jrobinso commented 4 years ago

@oneillkza no problem, were you the one that reported the "3rd gen preferences don't do anything"? there were 2 issues there, 1 is fixed (in snapshot build), and there is now an explicit menu to just set it.

oneillkza commented 4 years ago

Yep, that was me! Thanks -- I'll check out the snapshot. I also linked you a Nanopore bam for testing.