igvteam / igv

Integrative Genomics Viewer. Fast, efficient, scalable visualization tool for genomics data and annotations
https://igv.org
MIT License
629 stars 378 forks source link

[Feature request] please include functionality for colouring reads with supplementary alignments #731

Open oneillkza opened 4 years ago

oneillkza commented 4 years ago

Creating this ticket per the discussion in #632.

For calling SVs from long read sequence, these are mainly detected by reads having supplementary alignments. It would be very useful to be able to see at a glance which reads have supplementary alignments. Colouring reads by their direction where the supplementary alignment is opposite would also potentially be helpful (similar to "color alignments by mate orientation").

The ability to group reads by whether they have a supplementary alignment would also be very helpful. A stretch goal would be to group them by the edge of the read/alignment, so that they line up at the breakpoint, but I don't know how tricky that would be to implement.

amwenger commented 4 years ago

I also would find this to be a useful feature. To help with development, a PacBio dataset with reads aligned to GRCh38/hg38 showing different types of structural variants is available at:

The structural variants are at:

The current IGV flags the structural variant breakpoints by coloring clipped ends as magenta. Having a more meaningful color coding based on the type of evidence would be great.

Deletion

DEL

Insertion

INS

Tandem duplication

DUP

Intra-chromosomal rearrangement

image

Inter-chromosomal rearrangement

TRA

Inversion

image

Inverted duplication

INVDUP INVDUP-dotplot

amwenger commented 4 years ago

Let me know if you need any help with implementing this feature. A few suggestions to start:

Similar information to what is inferred from paired-end distances and orientations can be inferred from the relative orientation and distance (in reference and query coordinates) between alignments of adjacent segments of a long read (i.e. primary and supplementary alignments of adjacent portions of a read).

Signatures of structural variation in supplementary alignments

I prepared diagrams of how primary+supplementary alignments look at different types of structural variants. The signatures apply for both short and long reads, though are more common (and easier to interpret) with long reads. The draft diagrams are at https://bit.ly/2RmHnIZ.

Deletion

DEL

Insertion

INS

Tandem Duplication

DUP

Inversion

INV

Inverted duplication

INVDUP

Pseudocode implementation of "Color by Supplementary Alignment" logic

Parse primary and supplementary alignments for the read from the alignment and its SA tag
Sort alignments by query (read) start coordinate

If this end is the "left" end of the read in query coordinates:
    A := previous alignment in query coordinates (if any)
    B := This
Else:
    A := This
    B := next alignment in query coordinates (if any)

# Clipped sequence does not align
If A or B does not exist:
    Return default color (currently Magenta)

# Inter-chromosomal translocation
If A.chrom is different than B.chrom:
    Return color used for the other chromosome

# Calculate distance between alignments in query and reference coordinates
QueryDist := B.QueryStart - A.QueryEnd
RefPosA := A.RefEnd if (A is on + strand) else A.RefStart
RefPosB := B.RefStart if (B is on + strand) else B.RefEnd
RefDist := (RefPosB - RefPosA) if (A is on + strand) else (RefPosA - RefPosB)

# Calculate overlap of alignments in reference coordinates
RefOverlap := max(0, min(A.RefEnd,B.RefEnd) - max(A.RefStart,B.RefStart))

# Alignments are very far apart on the same chromosome (intra-chromosomal translocation)
If RefDist > 1Mb:
    Return color to indicate intra-chromosomal translocation

# Threshold for a "large" coordinate jump that is interesting to annotate.
T = 250bp

If A and B are on the same strand:
    # Duplication: B aligns upstream of A
    If RefDist < T:
        Return the duplication color (currently Green)
    # Deletion: B aligns downstream of A
    ElIf RefDist > T:
        Return the deletion color (currently Red)
    # Insertion: B aligns next to A but has a jump in query coordinates
    ElIf QueryDist > T:
        Return the insertion color (currently Blue)
    # Should not happen since alignment would just continue if there were no jump in query or reference coordinates
    Else:
        Return default color
Else: # A and B are on different strands
    # INVDUP
    If RefOverlap > T:
        Return the inverted duplication color (does not exist in current color scheme but would be useful)
    # INV
    Else:
        If A is on the + strand:
            Return light blue, the color used for "LL pair orientation" in current scheme
        Else:
            Return dark blue, the color used for "RR pair orientation" in the current scheme
jrobinso commented 4 years ago

@oneillkza @amwenger I'm starting on this now, apologies for the delay. Implementation will likely be in steps. One thought I had, which I will try immediately, is automatically group alignments that have supplemental alignments when they are linked, and place that group at the top.

jrobinso commented 4 years ago

@amwenger Before interpreting supplementary alignments I need to order them by the read sequence (currently they are just grouped together by common read name). My thought is to do that based on soft clipping in the cigar string. All All but the first should have a left soft clip, and the size of the clip should increase as you progress down the read. Do you see any potential issues with this, or have another suggestion?

amwenger commented 4 years ago

Thanks, @jrobinso.

One thought I had, which I will try immediately, is automatically group alignments that have supplemental alignments when they are linked, and place that group at the top.

I like that idea.

Before interpreting supplementary alignments I need to order them by the read sequence (currently they are just grouped together by common read name). My thought is to do that based on soft clipping in the cigar string. All but the first should have a left soft clip, and the size of the clip should increase as you progress down the read. Do you see any potential issues with this, or have another suggestion?

I agree that using clipping information in the CIGAR is the right approach to determining order of the aligned segments for a read. For details:

As an example with one read that shows an inversion:

$ paste <(bedtools bamtobed -i GRCh38.SA-example.HARDCLIP.bam | awk '{ print $1 ":" $2+1 "-" $3 "\t" $6; }') <(samtools view GRCh38.SA-example.HARDCLIP.bam | awk '{ print "CIGAR:"substr($6,0,20) "..." substr($6,length($6)-20); }') <(samtools view GRCh38.SA-example.HARDCLIP.bam | tr '\t' '\n' | grep '^SA:') | awk '{ print $1 "\t" $2; print "    " $3; print "    " $4; }'
chr1:197785807-197787658    +
    CIGAR:10=1X19=1X375=1X20=1...30=1X112=1X211=11509H
    SA:Z:chr1,197788855,+,3047S10314M2D,60,65;chr1,197787659,-,10312S1196M1853S,60,5;
chr1:197787659-197788854    -
    CIGAR:10312H283=1X476=1X12...1X65=1X20=1X227=1853H
    SA:Z:chr1,197788855,+,3047S10314M2D,60,65;chr1,197785807,+,1852M11509S,60,10;
chr1:197788855-197799170    +
    CIGAR:3047S169=1X68=1X44=1...19=1I136=1X186=1X135=
    SA:Z:chr1,197785807,+,1852M11509S,60,10;chr1,197787659,-,10312S1196M1853S,60,5;

The read has three alignments, two to the + strand and one to -. For each alignment, the other two are listed in the SA tag. The alignments are done with minimap2 using hard clipping. By convention, minimap2 writes a short version of the CIGAR string for supplementary alignments in the SA tag, providing only the clipping (indicated as soft clipping), the number of query bases aligned, and the number of reference bases spanned. Taking the first alignment and parsing it and its SA records:

chr1:197785807-197787658    +
    CIGAR:10=1X19=1X375=1X20=1...30=1X112=1X211=11509H
    SA:Z:chr1,197788855,+,3047S10314M2D,60,65;chr1,197787659,-,10312S1196M1853S,60,5;

# Read length is 13,361 bp from summing the query-consuming CIGAR operations.
#
# Alignment is to the "+" strand.  Left clipping is 0.  So, the read start coordinate is 0.  Right clipping is 11,509; so, the read end coordinate = 13361-11509=1852.
# The first supplementary alignment listed also is to the "+" strand.  Left clipping is 3047.  So, the read start coordinate is 3047.  Right clipping is 0; so, the read end coordinate is 13361-0=13361.
# The second supplementary alignment listed is to the "-" strand.  Right clipping is 1853.  So, the read start coordinate is 1853.  Left clipping is 10312; so, the read end coordinate is 13361-10312=3049.
#
# So, the alignments are:
# NAME       STRAND  [QSTART  QEND)
# Record     +       0       1852
# First_SA   +       3047    13361
# Second_SA  -       1853    3049
jrobinso commented 4 years ago

@amwenger Thanks for the input. RE grouping by readname vs SA tag, for visual use the only alignments that matter are the ones in view, so there shouldn't be any difference. For interpretation yes the SA tag is better.

Do you have a pointer to a good spec or description of the SA tag? The SAM spec has very little to say about it.

amwenger commented 4 years ago

RE grouping by readname vs SA tag, for visual use the only alignments that matter are the ones in view, so there shouldn't be any difference. For interpretation yes the SA tag is better.

That makes sense in general. Specifically for this feature - coloring alignment ends based on orientation of the adjacent aligned segment - I think it does not matter whether the adjacent alignment is within the view window, which is why I would rely on the SA tag. In my opinion, it is similar to how paired end reads are colored now - it does not matter whether the mate is also within the window.

good spec or description of the SA tag

I recommend the SAM Tags spec. Let me know if you need more information.

SA:Z:(rname,pos,strand,CIGAR,mapQ,NM;)

Other canonical alignments in a chimeric alignment, formatted as a semicolon-delimited list. Each element in the list represents a part of the chimeric alignment. Conventionally, at a supplementary line, the first element points to the primary line. Strand is either ‘+’ or ‘-’, indicating forward/reverse strand, corresponding to FLAG bit 0x10. Pos is a 1-based coordinate.

jrobinso commented 2 years ago

@amwenger Apologies for such a long delay, but I am picking this up with a goal to finish by the end of the year. Unfortunately the test bams you created, with examples of each SV type, are no longer available at the URLs below. Those would be most helpful. Do you still have them? These are the URLs I am referring to

BAM (97 MB): https://dl.dnanex.us/F/D/xzz9fGXBgx6Bqpbypf6KxqZqy5zg1k1VyKykVKg8/GRCh38.SV-examples.bam BAM index: https://dl.dnanex.us/F/D/K4qkpQyvF2k6YG2K8815jG266q3KB1kbKkbpjk2y/GRCh38.SV-examples.bam.bai

Thanks

amwenger commented 2 years ago

Hi @jrobinso. I uploaded to a new location and updated the URLs above:

Thanks for working on the feature. Let me know when I could be useful to test / work on it.