Closed jrobinso closed 5 years ago
From @amwenger wrt popup text:
"Perhaps we could list each SA alignment on its own line listing the read span, reference span, strand, and MAPQ (and perhaps even NM). One format would be “read 3,157-10,123 to chr6:43,143,415-43,149,942 (-) @ MAPQ 60 NM 763” which would mean that bases 3,157-10,123 of the read align to the specified reference location at the listed MAPQ and NM."
From @cwhelan
Being able to link the primary and supplementary alignments would be great. Since they could overlap with one another this might even work better if this worked similarly to a reads->group by->read name feature.
The other thing with long reads/contigs is that it'd be nice to be able to visualize how the different supplementary alignments relate to each other. For example, two different supplementary alignments might overlap with one another on the contig. Or, there might be a region of the contig that's not aligned in between two supplementary alignments. These can all mean different things in terms of structural variation.
Some examples:
1)
Read/Contig:
|-------*------*@----@-----|
Region between the stars aligns to chr1:1000-2000. Region between the at signs aligns to chr1:3000-4000.
This indicates that there's a deletion of chr1:2000-3000.
2)
Read/Contig:
|-------*------*---@----@-----|
Region between the stars aligns to chr1:1000-2000. Region between the at signs aligns to chr1:3000-4000. Region in the middle between the * and the @ is not covered by a primary/supplementary alignment.
This indicates that there's a deletion of chr1:2000-3000, and that at the site of the deletion that middle sequence between the * and the @ has been inserted.
2)
Read/Contig:
|-------*---@---*-@-----|
Region between the stars aligns to chr1:1000-2000. Region between the at signs aligns to chr1:3000-4000. The alignments overlap in read/contig coordinates.
This indicates that there's a deletion somewhere around chr1:2000-3000, but the reference sequence flanking each end of the deletion is homologous (such that the sequence on the contig between the @ and the * aligns well to both regions). We know there's a 1kb deletion but the start and end point can be shifted within the bounds of the homologous sequence.
Anyway, some sort of view that could list, lay out visually, or allow you to navigate between consecutive non-secondary alignments in read/contig coordinate order would be helpful.
More from @cwhelan
I think it does make sense to link them somehow. My only caveat is that if possible it might be best not to give too 'linear' an impression of them, since the supplementary alignments won't necessarily be linear along the read. For example if a read has four alignments:
|----|----|-----|----| A B C D
And regions A and D are on the same reference chromosome, but in reference coordinates D comes before A, and B and C are on different chromosomes, a simple line linking A and D might be misleading. Does that make sense? This sort of thing could happen in cancer genomes, and often contigs will have mobile element insertions that map to other mobile elements in the reference, so these type of alignments come up with some regularity even in germline data.
@cwhelan makes good points about the complexity of displaying SVs in reference aligned read pileups. For the more complex situations, I think the best value that IGV can provide is flagging the region so that it is clear that some SV event is present there. The visualization may be better done in a more specialized tool that ideally would be integrated with IGV (e.g. click on a read to view in SV tool).
I agree with @amwenger, more complex situations should probably be rendered in a separate linked visualization. We can only push the standard alignment view so far without it getting weird for the normal case. Overlapping alignments are particularly hard in the standard view.
@amwenger -- re Supplementary tag info, how does one determine the following? I don't see information on the read sequence
"mean that bases 3,157-10,123 of the read align to the specified reference location"
@jrobinso - That information can be obtained from the CIGAR string component of the SA entry. Clipping gives the read start position and the sum of match and insertion events give the number of read bases.
As an example (PacBio data so long CIGAR):
SA = chr6,43173849,+,4420S12M1I13M3D15S,52,74
CIGAR:
4420S read start position is 4421 (strand is "+")
12M 12 read bases
1I +1 = 13 read bases
13M +13 = 26 read bases
3D
15S
So that would say read 4,421-4,447 for this example
So is the CIGAR string of the SA entry relative to the read sequence of the current alignment? I thought that was the actual CIGAR of the corresponding alignment. If the read sequence is copied in all connected alignments the 2 would be the same, but I didn't think this was the case. The spec is ambiguous on it.
On 8/12/16 3:57 PM, amwenger wrote:
@jrobinso https://github.com/jrobinso - That information can be obtained from the CIGAR string component of the SA entry. Clipping gives the read start position and the sum of match and insertion events give the number of read bases.
As an example (PacBio data so long CIGAR):
|SA = chr6,43173849,+,4420S12M1I13M3D15S,52,74 CIGAR: 4420S read start position is 4421 (strand is "+") 12M 12 read bases 1I +1 = 13 read bases 13M +13 = 26 read bases 3D 15S So that would say read 4,421-4,447 for this example |
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/298#issuecomment-239579437, or mute the thread https://github.com/notifications/unsubscribe-auth/AA49HH1KzXjT-Ypbp06kdhbMzV1uX7TYks5qfPpkgaJpZM4JdMOG.
I do not know for certain, but I think that an SA entry uses exactly what you say - the CIGAR copied from the other alignment. My point about strand is that different read alignments may be on different strands and thus some CIGARs will be relative to the "native" read orientation and some to the reverse complement. When reporting read coordinates, I think it makes most sense to use the native read orientation.
I'm looking at some PacBio alignments from the Broad and they don't all follow this convention. Some are hard clipped, some are soft clipped. The important point would be are all cigars relative to the same read sequence (the "original" sequence)? Nothing in the spec says this has to be so, and I know for one technology in use here it is not. Rather, the read sequence is just split up among the various pieces with no clipping information (i.e. no hard clip on the cigar). So maybe if clip (hard or soft, or both) are specified we can ASSUME they are in reference to the original read sequence, if missing we can't know.
Thanks for the explanation.
Another alternative to assuming a convention is to leave out the read coordinate portion of my proposed format; that is, just say "chr6:43,143,415-43,149,942 (-) @ MAPQ 60 NM 763". The CIGAR will still be needed to get the REF end coordinate, but that does not depend on any strand convention.
@jrobinso I have been utilizing this new "Link Supplementary Alignments" feature; it is very useful. I have a few suggestions to make it even better:
0x40
flag (isFirstOfPair
attribute) to match in order to connect alignments from a paired read.@amwenger thanks for the feedback, very helpful
RE 1) -- we could do that. However, currently the connecting line is a little tenuous, I don't have any code to order the alignments correctly or provide any interpretation for the connecting line -- i.e. the alignments might not be "connected" at all, they just happen to be adjacent, so I'm a bit hesitant to emphasize the line or equate it to a deletion line which we do know how to interpret (i.e. it is explicitly marked with a D operator).
RE 2) -- why would the coloring be common (why is it common to have mixed strands)? If you recall I asked about this after viewing the 1KG pacbio file you gave me for testing, I think the source was NCBI. We decided this was some issue with the file or process that produced it. The pacbio file I'm using from the Broad does not have this issue, its very clean in this regard. In any event if its common its not useful, rather than just coloring the connecting line I think there should be an option to turn it off. We could generalize "color by pair orientation" to include this, and you can switch it off as you do any other color option. The line is not always visible, e.g. with inversions the alignments are butted together, and I think that is too subtle.
RE 3) -- that should not happen, this is how the linking "readname" is constructed to link, and it is working correctly with my Illumina files. Are you sure you are using a recent version?
bc = a.getReadName();
if(a.isPaired()) {
bc += a.isFirstOfPair() ? "/1" : "/2";
}
Thank you for the responses, Jim. I think you are correct on all fronts.
RE 2) I still think it should be an option, perhaps on by default when you select linking. The option would appear with the other "color by" entries.
Hi, Jim. Would you consider adding a View supplementary alignment region in split screen option similar to View mate region in split screen? Or alternatively, a View / Split screen menu option? http://software.broadinstitute.org/software/igv/AlignmentData#splitscreen
I support @sjackman's proposal because for large-sized SV, the supplementary alignments might be located quite far away and it would be convenient to have such a feature.
Or on a different chromosome, or in my case with de novo assemblies, on a different scaffold.
Yes, good idea, so use the SA tag to bring up a view of each region, and maybe highlight the chosen read? I haven't verified exhaustively but I assume every alignment from a split read has the SA tag, including the "primary" one.
On Wed, May 31, 2017 at 12:25 PM, Shaun Jackman notifications@github.com wrote:
Or on a different chromosome, or in my case with de novo assemblies, on a different scaffold.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/298#issuecomment-305291226, or mute the thread https://github.com/notifications/unsubscribe-auth/AA49HDZZJFzbJjJWN6-yrBcVxgh0UJSkks5r_b6lgaJpZM4JdMOG .
Great! Yes, correct.
SA:Z:(rname ,pos ,strand ,CIGAR ,mapQ ,NM ;)+ Other canonical alignments in a chimeric alignment, for- matted as a semicolon-delimited list. Each element in the list represents a part of the chimeric align- ment. Conventionally, at a supplementary line, the first element points to the primary line.
https://github.com/samtools/hts-specs/blob/master/SAMtags.pdf
@jrobinso @sjackman , just to clarify (both of you mostly likely already know this): the SA tags for all the chimeric alignments of a sequence are not the same. An alignment is not going to put itself into its SA tag.
In the mean time, I've been looking into complex SV's. And one of the signature of complex SV is that a long read (might be an locally assembled sequence) might produce a large collection of alignments (I've seen up to 10), but some of them have low mapping quality and short alignment length. It might worth thinking about how such small "noise" could be filtered.
EDIT
One example is here
Not just for chimeric alignments, but I've for a long time wished for a feature to filter alignments displayed by IGV by mapping quality (MAPQ), alignment score (AS:i
or AS:f
for Long Ranger), and number of mismatches (NM:i
). For example, AS ≥ 100 & NM < 5. Filtering by percent identity or percent query coverage could be useful too, but trickier since these don't usually have SAM tags.
@sjackman @amwenger @SHuang-Broad This is a long thread with many good suggestions. However its 2 years old. I'm trying to clear up open IGV issues. If any of the specific suggestions here are still relevant, I suspect they are, could you indicate it, or better open an issue for that suggestion and reference this issue.
@jprobinso, just noting that Shaun's suggestion of a "split screen by supplementary alignment" is also one I've been asking for, and would definitely be very useful! https://github.com/igvteam/igv/issues/298#issuecomment-305267693
@oneillkza I just opened an explicit issue for this. All the issues tagged "3rd gen" are up next for IGV development. As you may or may not know I also do the igv.js and juicebox.js development, so I rotate, IGV is up again next week and hope to release these by the end of the month.
Thanks @jprobinso, that's great news!
PS: I do still have it on my to-do list to get you a small example Nanopore bam file. I'm also a little snowed under right now!
@oneillkza no problem, were you the one that reported the "3rd gen preferences don't do anything"? there were 2 issues there, 1 is fixed (in snapshot build), and there is now an explicit menu to just set it.
Yep, that was me! Thanks -- I'll check out the snapshot. I also linked you a Nanopore bam for testing.
This issue is for general discussion of treatment supplementary alignments (bam tag 0x800 & SA tag). See also #277 for discussion of connecting supplementary alignments. @amwenger @pb-jchin @cwhelan @SHuang-Broad