Open oneillkza opened 5 years ago
I also would find this to be a useful feature. To help with development, a PacBio dataset with reads aligned to GRCh38/hg38 showing different types of structural variants is available at:
The structural variants are at:
The current IGV flags the structural variant breakpoints by coloring clipped ends as magenta. Having a more meaningful color coding based on the type of evidence would be great.
Let me know if you need any help with implementing this feature. A few suggestions to start:
Similar information to what is inferred from paired-end distances and orientations can be inferred from the relative orientation and distance (in reference and query coordinates) between alignments of adjacent segments of a long read (i.e. primary and supplementary alignments of adjacent portions of a read).
I prepared diagrams of how primary+supplementary alignments look at different types of structural variants. The signatures apply for both short and long reads, though are more common (and easier to interpret) with long reads. The draft diagrams are at https://bit.ly/2RmHnIZ.
Parse primary and supplementary alignments for the read from the alignment and its SA tag
Sort alignments by query (read) start coordinate
If this end is the "left" end of the read in query coordinates:
A := previous alignment in query coordinates (if any)
B := This
Else:
A := This
B := next alignment in query coordinates (if any)
# Clipped sequence does not align
If A or B does not exist:
Return default color (currently Magenta)
# Inter-chromosomal translocation
If A.chrom is different than B.chrom:
Return color used for the other chromosome
# Calculate distance between alignments in query and reference coordinates
QueryDist := B.QueryStart - A.QueryEnd
RefPosA := A.RefEnd if (A is on + strand) else A.RefStart
RefPosB := B.RefStart if (B is on + strand) else B.RefEnd
RefDist := (RefPosB - RefPosA) if (A is on + strand) else (RefPosA - RefPosB)
# Calculate overlap of alignments in reference coordinates
RefOverlap := max(0, min(A.RefEnd,B.RefEnd) - max(A.RefStart,B.RefStart))
# Alignments are very far apart on the same chromosome (intra-chromosomal translocation)
If RefDist > 1Mb:
Return color to indicate intra-chromosomal translocation
# Threshold for a "large" coordinate jump that is interesting to annotate.
T = 250bp
If A and B are on the same strand:
# Duplication: B aligns upstream of A
If RefDist < T:
Return the duplication color (currently Green)
# Deletion: B aligns downstream of A
ElIf RefDist > T:
Return the deletion color (currently Red)
# Insertion: B aligns next to A but has a jump in query coordinates
ElIf QueryDist > T:
Return the insertion color (currently Blue)
# Should not happen since alignment would just continue if there were no jump in query or reference coordinates
Else:
Return default color
Else: # A and B are on different strands
# INVDUP
If RefOverlap > T:
Return the inverted duplication color (does not exist in current color scheme but would be useful)
# INV
Else:
If A is on the + strand:
Return light blue, the color used for "LL pair orientation" in current scheme
Else:
Return dark blue, the color used for "RR pair orientation" in the current scheme
@oneillkza @amwenger I'm starting on this now, apologies for the delay. Implementation will likely be in steps. One thought I had, which I will try immediately, is automatically group alignments that have supplemental alignments when they are linked, and place that group at the top.
@amwenger Before interpreting supplementary alignments I need to order them by the read sequence (currently they are just grouped together by common read name). My thought is to do that based on soft clipping in the cigar string. All All but the first should have a left soft clip, and the size of the clip should increase as you progress down the read. Do you see any potential issues with this, or have another suggestion?
Thanks, @jrobinso.
One thought I had, which I will try immediately, is automatically group alignments that have supplemental alignments when they are linked, and place that group at the top.
I like that idea.
Before interpreting supplementary alignments I need to order them by the read sequence (currently they are just grouped together by common read name). My thought is to do that based on soft clipping in the cigar string. All but the first should have a left soft clip, and the size of the clip should increase as you progress down the read. Do you see any potential issues with this, or have another suggestion?
I agree that using clipping information in the CIGAR is the right approach to determining order of the aligned segments for a read. For details:
SA
tag, not from other alignments with the same read name in the current window. For large structural variants - and definitely for inter-chromosomal rearrangements - only some alignments for a read will be in a given window.+
strand, the left clipping provides the query/read start coordinate. For alignments to the -
strand, right clipping does. The opposite applies for query/read end coordinate: query end is provided by right clipping for +
strand and left clipping for -
strand.As an example with one read that shows an inversion:
$ paste <(bedtools bamtobed -i GRCh38.SA-example.HARDCLIP.bam | awk '{ print $1 ":" $2+1 "-" $3 "\t" $6; }') <(samtools view GRCh38.SA-example.HARDCLIP.bam | awk '{ print "CIGAR:"substr($6,0,20) "..." substr($6,length($6)-20); }') <(samtools view GRCh38.SA-example.HARDCLIP.bam | tr '\t' '\n' | grep '^SA:') | awk '{ print $1 "\t" $2; print " " $3; print " " $4; }'
chr1:197785807-197787658 +
CIGAR:10=1X19=1X375=1X20=1...30=1X112=1X211=11509H
SA:Z:chr1,197788855,+,3047S10314M2D,60,65;chr1,197787659,-,10312S1196M1853S,60,5;
chr1:197787659-197788854 -
CIGAR:10312H283=1X476=1X12...1X65=1X20=1X227=1853H
SA:Z:chr1,197788855,+,3047S10314M2D,60,65;chr1,197785807,+,1852M11509S,60,10;
chr1:197788855-197799170 +
CIGAR:3047S169=1X68=1X44=1...19=1I136=1X186=1X135=
SA:Z:chr1,197785807,+,1852M11509S,60,10;chr1,197787659,-,10312S1196M1853S,60,5;
The read has three alignments, two to the +
strand and one to -
. For each alignment, the other two are listed in the SA tag. The alignments are done with minimap2
using hard clipping. By convention, minimap2
writes a short version of the CIGAR string for supplementary alignments in the SA tag, providing only the clipping (indicated as soft clipping), the number of query bases aligned, and the number of reference bases spanned. Taking the first alignment and parsing it and its SA records:
chr1:197785807-197787658 +
CIGAR:10=1X19=1X375=1X20=1...30=1X112=1X211=11509H
SA:Z:chr1,197788855,+,3047S10314M2D,60,65;chr1,197787659,-,10312S1196M1853S,60,5;
# Read length is 13,361 bp from summing the query-consuming CIGAR operations.
#
# Alignment is to the "+" strand. Left clipping is 0. So, the read start coordinate is 0. Right clipping is 11,509; so, the read end coordinate = 13361-11509=1852.
# The first supplementary alignment listed also is to the "+" strand. Left clipping is 3047. So, the read start coordinate is 3047. Right clipping is 0; so, the read end coordinate is 13361-0=13361.
# The second supplementary alignment listed is to the "-" strand. Right clipping is 1853. So, the read start coordinate is 1853. Left clipping is 10312; so, the read end coordinate is 13361-10312=3049.
#
# So, the alignments are:
# NAME STRAND [QSTART QEND)
# Record + 0 1852
# First_SA + 3047 13361
# Second_SA - 1853 3049
@amwenger Thanks for the input. RE grouping by readname vs SA tag, for visual use the only alignments that matter are the ones in view, so there shouldn't be any difference. For interpretation yes the SA tag is better.
Do you have a pointer to a good spec or description of the SA tag? The SAM spec has very little to say about it.
RE grouping by readname vs SA tag, for visual use the only alignments that matter are the ones in view, so there shouldn't be any difference. For interpretation yes the SA tag is better.
That makes sense in general. Specifically for this feature - coloring alignment ends based on orientation of the adjacent aligned segment - I think it does not matter whether the adjacent alignment is within the view window, which is why I would rely on the SA tag. In my opinion, it is similar to how paired end reads are colored now - it does not matter whether the mate is also within the window.
good spec or description of the SA tag
I recommend the SAM Tags spec. Let me know if you need more information.
SA:Z:(rname,pos,strand,CIGAR,mapQ,NM;)
Other canonical alignments in a chimeric alignment, formatted as a semicolon-delimited list. Each element in the list represents a part of the chimeric alignment. Conventionally, at a supplementary line, the first element points to the primary line. Strand is either ‘+’ or ‘-’, indicating forward/reverse strand, corresponding to FLAG bit 0x10. Pos is a 1-based coordinate.
@amwenger Apologies for such a long delay, but I am picking this up with a goal to finish by the end of the year. Unfortunately the test bams you created, with examples of each SV type, are no longer available at the URLs below. Those would be most helpful. Do you still have them? These are the URLs I am referring to
BAM (97 MB): https://dl.dnanex.us/F/D/xzz9fGXBgx6Bqpbypf6KxqZqy5zg1k1VyKykVKg8/GRCh38.SV-examples.bam BAM index: https://dl.dnanex.us/F/D/K4qkpQyvF2k6YG2K8815jG266q3KB1kbKkbpjk2y/GRCh38.SV-examples.bam.bai
Thanks
Hi @jrobinso. I uploaded to a new location and updated the URLs above:
Thanks for working on the feature. Let me know when I could be useful to test / work on it.
Creating this ticket per the discussion in #632.
For calling SVs from long read sequence, these are mainly detected by reads having supplementary alignments. It would be very useful to be able to see at a glance which reads have supplementary alignments. Colouring reads by their direction where the supplementary alignment is opposite would also potentially be helpful (similar to "color alignments by mate orientation").
The ability to group reads by whether they have a supplementary alignment would also be very helpful. A stretch goal would be to group them by the edge of the read/alignment, so that they line up at the breakpoint, but I don't know how tricky that would be to implement.