oneillkza commented 5 years ago

Creating this ticket per the discussion in #632.

For calling SVs from long read sequence, these are mainly detected by reads having supplementary alignments. It would be very useful to be able to see at a glance which reads have supplementary alignments. Colouring reads by their direction where the supplementary alignment is opposite would also potentially be helpful (similar to "color alignments by mate orientation").

The ability to group reads by whether they have a supplementary alignment would also be very helpful. A stretch goal would be to group them by the edge of the read/alignment, so that they line up at the breakpoint, but I don't know how tricky that would be to implement.

amwenger commented 4 years ago

I also would find this to be a useful feature. To help with development, a PacBio dataset with reads aligned to GRCh38/hg38 showing different types of structural variants is available at:

BAM (97 MB): https://pbsharing.blob.core.windows.net/igv/GRCh38.SV-examples.bam
BAM index: https://pbsharing.blob.core.windows.net/igv/GRCh38.SV-examples.bam.bai

The structural variants are at:

45kb deletion chr1:72300640-72346156
8.5kb insertion chr1:248404195-248404195
10.3kb tandem duplication chr2:122526667-122536995
intra-chromosomal rearragement chr16:21583121-21583121
inter-chromosomal rearrangement (to chr8) chr17:78749731-78749731
1.1kb inversion chr1:197787659-197788854
18kb inverted duplication chr8:90804093-90822180

The current IGV flags the structural variant breakpoints by coloring clipped ends as magenta. Having a more meaningful color coding based on the type of evidence would be great.

Deletion

Insertion

INS

Tandem duplication

DUP

Intra-chromosomal rearrangement

Inter-chromosomal rearrangement

TRA

Inversion

Inverted duplication

INVDUP INVDUP-dotplot

amwenger commented 4 years ago

Let me know if you need any help with implementing this feature. A few suggestions to start:

Consider implementing this by extending the current "Flag clipping" feature into an "Annotate clipping" feature that colors the clipped ends based on how the end aligns relative to the clipped sequence.
I would color the clipped ends instead of entire alignments because: 1) long reads often map in more than two alignments and thus the color might be ambiguous (e.g. deletion on one end, inversion on another); 2) coloring the clipped ends visually indicates the specific location of the SV rather than the general location from coloring an entire read; and 3) it is less visually overwhelming.
Use the same scheme as for coloring paired-end short read alignments by "Insert size and orientation." Using the same colors for the same types of structural variants will show when short reads and long reads agree.

Similar information to what is inferred from paired-end distances and orientations can be inferred from the relative orientation and distance (in reference and query coordinates) between alignments of adjacent segments of a long read (i.e. primary and supplementary alignments of adjacent portions of a read).

Signatures of structural variation in supplementary alignments

I prepared diagrams of how primary+supplementary alignments look at different types of structural variants. The signatures apply for both short and long reads, though are more common (and easier to interpret) with long reads. The draft diagrams are at https://bit.ly/2RmHnIZ.

Deletion

DEL

Insertion

INS

Tandem Duplication

DUP

Inversion

INV

Inverted duplication

INVDUP

Pseudocode implementation of "Color by Supplementary Alignment" logic

Parse primary and supplementary alignments for the read from the alignment and its SA tag
Sort alignments by query (read) start coordinate

If this end is the "left" end of the read in query coordinates:
    A := previous alignment in query coordinates (if any)
    B := This
Else:
    A := This
    B := next alignment in query coordinates (if any)

# Clipped sequence does not align
If A or B does not exist:
    Return default color (currently Magenta)

# Inter-chromosomal translocation
If A.chrom is different than B.chrom:
    Return color used for the other chromosome

# Calculate distance between alignments in query and reference coordinates
QueryDist := B.QueryStart - A.QueryEnd
RefPosA := A.RefEnd if (A is on + strand) else A.RefStart
RefPosB := B.RefStart if (B is on + strand) else B.RefEnd
RefDist := (RefPosB - RefPosA) if (A is on + strand) else (RefPosA - RefPosB)

# Calculate overlap of alignments in reference coordinates
RefOverlap := max(0, min(A.RefEnd,B.RefEnd) - max(A.RefStart,B.RefStart))

# Alignments are very far apart on the same chromosome (intra-chromosomal translocation)
If RefDist > 1Mb:
    Return color to indicate intra-chromosomal translocation

# Threshold for a "large" coordinate jump that is interesting to annotate.
T = 250bp

If A and B are on the same strand:
    # Duplication: B aligns upstream of A
    If RefDist < T:
        Return the duplication color (currently Green)
    # Deletion: B aligns downstream of A
    ElIf RefDist > T:
        Return the deletion color (currently Red)
    # Insertion: B aligns next to A but has a jump in query coordinates
    ElIf QueryDist > T:
        Return the insertion color (currently Blue)
    # Should not happen since alignment would just continue if there were no jump in query or reference coordinates
    Else:
        Return default color
Else: # A and B are on different strands
    # INVDUP
    If RefOverlap > T:
        Return the inverted duplication color (does not exist in current color scheme but would be useful)
    # INV
    Else:
        If A is on the + strand:
            Return light blue, the color used for "LL pair orientation" in current scheme
        Else:
            Return dark blue, the color used for "RR pair orientation" in the current scheme

jrobinso commented 4 years ago

@oneillkza @amwenger I'm starting on this now, apologies for the delay. Implementation will likely be in steps. One thought I had, which I will try immediately, is automatically group alignments that have supplemental alignments when they are linked, and place that group at the top.

jrobinso commented 4 years ago

@amwenger Before interpreting supplementary alignments I need to order them by the read sequence (currently they are just grouped together by common read name). My thought is to do that based on soft clipping in the cigar string. All All but the first should have a left soft clip, and the size of the clip should increase as you progress down the read. Do you see any potential issues with this, or have another suggestion?

amwenger commented 4 years ago

Thanks, @jrobinso.

One thought I had, which I will try immediately, is automatically group alignments that have supplemental alignments when they are linked, and place that group at the top.

I like that idea.

Before interpreting supplementary alignments I need to order them by the read sequence (currently they are just grouped together by common read name). My thought is to do that based on soft clipping in the cigar string. All but the first should have a left soft clip, and the size of the clip should increase as you progress down the read. Do you see any potential issues with this, or have another suggestion?

I agree that using clipping information in the CIGAR is the right approach to determining order of the aligned segments for a read. For details:

I think supplementary alignments should be parsed from the SA tag, not from other alignments with the same read name in the current window. For large structural variants - and definitely for inter-chromosomal rearrangements - only some alignments for a read will be in a given window.
Both hard and soft clipping count towards the query/read start coordinate.
For alignments to the + strand, the left clipping provides the query/read start coordinate. For alignments to the - strand, right clipping does. The opposite applies for query/read end coordinate: query end is provided by right clipping for + strand and left clipping for - strand.

As an example with one read that shows an inversion:

$ paste <(bedtools bamtobed -i GRCh38.SA-example.HARDCLIP.bam | awk '{ print $1 ":" $2+1 "-" $3 "\t" $6; }') <(samtools view GRCh38.SA-example.HARDCLIP.bam | awk '{ print "CIGAR:"substr($6,0,20) "..." substr($6,length($6)-20); }') <(samtools view GRCh38.SA-example.HARDCLIP.bam | tr '\t' '\n' | grep '^SA:') | awk '{ print $1 "\t" $2; print "    " $3; print "    " $4; }'
chr1:197785807-197787658    +
    CIGAR:10=1X19=1X375=1X20=1...30=1X112=1X211=11509H
    SA:Z:chr1,197788855,+,3047S10314M2D,60,65;chr1,197787659,-,10312S1196M1853S,60,5;
chr1:197787659-197788854    -
    CIGAR:10312H283=1X476=1X12...1X65=1X20=1X227=1853H
    SA:Z:chr1,197788855,+,3047S10314M2D,60,65;chr1,197785807,+,1852M11509S,60,10;
chr1:197788855-197799170    +
    CIGAR:3047S169=1X68=1X44=1...19=1I136=1X186=1X135=
    SA:Z:chr1,197785807,+,1852M11509S,60,10;chr1,197787659,-,10312S1196M1853S,60,5;

The read has three alignments, two to the + strand and one to -. For each alignment, the other two are listed in the SA tag. The alignments are done with minimap2 using hard clipping. By convention, minimap2 writes a short version of the CIGAR string for supplementary alignments in the SA tag, providing only the clipping (indicated as soft clipping), the number of query bases aligned, and the number of reference bases spanned. Taking the first alignment and parsing it and its SA records:

chr1:197785807-197787658    +
    CIGAR:10=1X19=1X375=1X20=1...30=1X112=1X211=11509H
    SA:Z:chr1,197788855,+,3047S10314M2D,60,65;chr1,197787659,-,10312S1196M1853S,60,5;

# Read length is 13,361 bp from summing the query-consuming CIGAR operations.
#
# Alignment is to the "+" strand.  Left clipping is 0.  So, the read start coordinate is 0.  Right clipping is 11,509; so, the read end coordinate = 13361-11509=1852.
# The first supplementary alignment listed also is to the "+" strand.  Left clipping is 3047.  So, the read start coordinate is 3047.  Right clipping is 0; so, the read end coordinate is 13361-0=13361.
# The second supplementary alignment listed is to the "-" strand.  Right clipping is 1853.  So, the read start coordinate is 1853.  Left clipping is 10312; so, the read end coordinate is 13361-10312=3049.
#
# So, the alignments are:
# NAME       STRAND  [QSTART  QEND)
# Record     +       0       1852
# First_SA   +       3047    13361
# Second_SA  -       1853    3049

jrobinso commented 4 years ago

@amwenger Thanks for the input. RE grouping by readname vs SA tag, for visual use the only alignments that matter are the ones in view, so there shouldn't be any difference. For interpretation yes the SA tag is better.

Do you have a pointer to a good spec or description of the SA tag? The SAM spec has very little to say about it.

amwenger commented 4 years ago

RE grouping by readname vs SA tag, for visual use the only alignments that matter are the ones in view, so there shouldn't be any difference. For interpretation yes the SA tag is better.

That makes sense in general. Specifically for this feature - coloring alignment ends based on orientation of the adjacent aligned segment - I think it does not matter whether the adjacent alignment is within the view window, which is why I would rely on the SA tag. In my opinion, it is similar to how paired end reads are colored now - it does not matter whether the mate is also within the window.

good spec or description of the SA tag

I recommend the SAM Tags spec. Let me know if you need more information.

SA:Z:(rname,pos,strand,CIGAR,mapQ,NM;)

Other canonical alignments in a chimeric alignment, formatted as a semicolon-delimited list. Each element in the list represents a part of the chimeric alignment. Conventionally, at a supplementary line, the first element points to the primary line. Strand is either ‘+’ or ‘-’, indicating forward/reverse strand, corresponding to FLAG bit 0x10. Pos is a 1-based coordinate.

jrobinso commented 3 years ago

@amwenger Apologies for such a long delay, but I am picking this up with a goal to finish by the end of the year. Unfortunately the test bams you created, with examples of each SV type, are no longer available at the URLs below. Those would be most helpful. Do you still have them? These are the URLs I am referring to

BAM (97 MB): https://dl.dnanex.us/F/D/xzz9fGXBgx6Bqpbypf6KxqZqy5zg1k1VyKykVKg8/GRCh38.SV-examples.bam BAM index: https://dl.dnanex.us/F/D/K4qkpQyvF2k6YG2K8815jG266q3KB1kbKkbpjk2y/GRCh38.SV-examples.bam.bai

Thanks

amwenger commented 3 years ago

Hi @jrobinso. I uploaded to a new location and updated the URLs above:

BAM (97 MB): https://pbsharing.blob.core.windows.net/igv/GRCh38.SV-examples.bam
BAM index: https://pbsharing.blob.core.windows.net/igv/GRCh38.SV-examples.bam.bai

Thanks for working on the feature. Let me know when I could be useful to test / work on it.

igvteam / igv

[Feature request] please include functionality for colouring reads with supplementary alignments #731

Deletion

Insertion

Tandem duplication

Intra-chromosomal rearrangement

Inter-chromosomal rearrangement

Inversion

Inverted duplication

Signatures of structural variation in supplementary alignments

Deletion

Insertion

Tandem Duplication

Inversion

Inverted duplication

Pseudocode implementation of "Color by Supplementary Alignment" logic