virajbdeshpande / AmpliconArchitect

AmpliconArchitect (AA) is a tool to identify one or more connected genomic regions which have simultaneous copy number amplification and elucidates the architecture of the amplicon. In the current version, AA takes as input next generation sequencing reads (paired-end Illumina reads) mapped to the hg19/GRCh37 reference sequence and one or more regions of interest. Please "watch" this repository for improvements in runtime, accuracy and annotations for GRCh38 human reference genome coming up soon.
Other
134 stars 42 forks source link

cycle output #9

Closed wisekh6 closed 7 years ago

wisekh6 commented 7 years ago

Hello,

Sorry for asking many questions. I am very interested in applying your tool to my work, hoping to better understand your tool's output.

I am looking at a "cycles.txt" file which (I believe) may help to reconstruct an ordered list of segments. Shown below are three cycles in my "cycles.txt" file. Q1. What does "-" sign represent in the below cycles? I guess it may represent a reverse complement of the segment. Q2. Cycle 2 has only one segment. Does this mean that 5' and 3' of the segment is likely to be connected? Q3. Does my cycles.txt suggest there would be three circular forms?

# Outputs in my cycles.txt ### Cycle=1;Copy_count=34.2273186344;Segments=3+,2+,7-,8- Cycle=2;Copy_count=2.63516498752;Segments=4+ Cycle=3;Copy_count=2.45192540644;Segments=0+,9-,5+,0-

Thank you,

virajbdeshpande commented 7 years ago

1) That is right, "-" means reverse complement. 2) That is right, the 5' and 3' ends are connected 3) The idea behind AA's cycle representation is that NGS is not always sufficient to determine the entire structure and that there are multiple possible structures, one or more of these can be present in the genome. So if two cycles produced by AA have any overlapping segments, then can be merged to form a larger cycle by switching the path traversal across the overlap.

For example, if segment "4" from cycle 2 overlaps a segment from cycle 1, then cycles 1 and 2 can be combined to represent a larger cycle such that the overlapping part is repeated. But note that their copy numbers are 34 and 2.6. So in case of combining, it could mean that only a small fraction of the amplicons had some heterogeneity consisting of the combined structure. You can visit genomequery.ucsd.edu:8800, upload the cycles file and visualize and interactively combine the structures. Cycles which contain a "0" are actually connected to a "source vertex", meaning either that they have unknown segments, or that these are linear structures in the chromosomes.

In this particular case, since cycle 1 has such a large copy number as compared to the other cycles, I would rather only consider cycle 1. The other cycles can either be artifacts of noisy data or they can be very low copy structures that are not really amplified.

virajbdeshpande commented 7 years ago

In the spirit of Github, questions are definitely encouraged. These will help to create a good documentation of the tool. If it is alright, I will adapt these questions into FAQs section.

wisekh6 commented 7 years ago

Thank you very much for your quick response.

By the way, would you answer my questions about what "Sequence edge" and "Breakpoint edge", respectively represent? In other words, Q1. I believe that "Sequence edge" represents "segmented copy numbers" predicted by your mean field approach. Is it correct? Q2. "Breakpoint edge" may represent edges connecting the above segmented copy numbers. Is my understand correct? Q3. The last columns in the "Sequence edge" and "Breakpoint edge" section do not have a corresponding column name. what do the last columns represent?

image

virajbdeshpande commented 7 years ago

1) The sequence edge refers to the actual genomic interval, so the sequence edge 5:116210033- 5:116299741+ represents the genomic interval of size 89708 base pairs. The PredictedCopyCount is assigned by an optimization procedure on the breakpoint graph and not the meanshift approach. The meanshift approach is only used as an intermediate step and for displaying the graphic.

2) Right, breakpoint edges represent connections between end points of segments.

3) Thanks for point this out, last column for sequence edges shows number of reads mapping to the sequence edge. Last two columns of breakpoint edge encode information for the homology if AA found any split reads. Homology refers to a repeated sequence that is common to both the end points of the connected sequence edges, but only appears once in the amplicon. The 2nd last column represents the size of homology (negative if insertion), the last column represents the sequence of the homology or insertion.

wisekh6 commented 7 years ago

Sorry for my ignorance, and I have additional follow-up questions.

The sequence edge refers to the actual genomic interval, so the sequence edge 5:116210033- 5:116299741+ represents the genomic interval of size 89708 base pairs.

So sequences are indicated in "amplicon figure" with the horizontal lines located on corresponding coverage (Y axis). Correct?

Right, breakpoint edges represent connections between end points of segments.

These edge are indicated with the curved lines (whose thickness may represent # of supporting reads) in "amplicon figure". Correct? What do different colors represent? I guess they are different "cycles"?

Thank you, and sorry for my questions during weekend.

virajbdeshpande commented 7 years ago

There are three levels of analysis:

  1. Low level analysis: Looks at coverage histogram, segmentation of the histogram, and breakpoint edges. The png/pdf figure produces by AA shows these three features.

    • The horizontal bars show the meanshift segmentation of the coverage histogram.
    • The colored arcs or vertical lines show breakpoint edges. The color of the arcs represents the strands of the edge. Red : length discordant (e.g. chr1:1000+->chr1:2000-), brown : everted (chr1:1000-->chr1:2000+), magenta : forward (chr1:1000+->chr1:2000+), teal : reverse (chr1:1000-->chr1:2000-). Vertical edges are blue if they connect to a different chromosome. Thickness simply shows the number of read pairs supporting each arc.
  2. Breakpoint graph: Consisting of breakpoint edges and sequence edges. Breakpoint edges are same as in (1) above. However, sequence edges are produced by a combination of the coverage segmentation and endpoints of breakpoints. We assign copy counts to breakpoint edges and sequence edges here. However, this information is not shown in any figures as it is considered an intermediate step. As you can see, the number of sequence edges can be much larger than the number of meanshift segments and it can be difficult to visualize them individually.

  3. Cycles: The graph is decomposed into cycles in the cycle file. This is the final output. You can upload the png file from (1) and the cycles file from (3) to genomequery.ucsd.edu:8800 to visualize the cycles in the context of the low-level data.

In general, step 1 is most reliable and step 3 is least reliable because each successive step depends on the output of the previous step.

wisekh6 commented 7 years ago

Sorry. I still have several things that I haven't fully understood regarding your output.

The sequence edge refers to the actual genomic interval, so the sequence edge 5:116210033- 5:116299741+ represents the genomic interval of size 89708 base pairs.

Would you briefly explain about what "+" and "-" mean in the sequence edge? I thought that the sequence edge (5:116210033- 5:116299741+) starts from 5' of 5:116210033 and ends at 3' of 5:116299741. Is this correct?

Thank you,

virajbdeshpande commented 7 years ago

Yes, you are correct. The '+' and '-' just reflects the naming convention of the breakpoint vertices used for the breakpoint edges. For a sequence edge, the 5' end will always have a '-' and 3' end will always have a '+'.

wisekh6 commented 7 years ago

Hi Viraj,

I have another question. I am looking at a "cycles.txt" file. I think that the cycles forming a "true" circle with a high level of copy count are likely to be a real candidate for ecDNA. "True" means the cycle does not have a 0+ or 0-. Do you think my interpretation is reasonable?

Thanks,

wisekh6 commented 7 years ago

Sorry that I still have more questions in addition to the above.

Q1:

Low level analysis: Looks at coverage histogram, segmentation of the histogram, and breakpoint >edges. The png/pdf figure produces by AA shows these three features. The horizontal bars show the meanshift segmentation of the coverage histogram.

Which file contains the horizontal bar locations and their meanshift segmentation values? I guess that start/end positions and PredictedCopyCount of individual sequences in "Sequence edge" of the "graph.txt" correspond to individual horizontal bar locations and their meanshift segmentation values. Is it correct?

Q2: Are all locations based on 1-based coordinate system?

Q3: Would you give me a brief explanation about PredictedCopyCount and AverageCoverage in "graph.txt", and Copy_count in "cycles.txt"?

By the way, this tool works very well in my data set; it correctly re-detected a majority of my previous findings.

virajbdeshpande commented 7 years ago

Hi, Sorry I was a bit occupied. I will respond tomorrow.

wisekh6 commented 7 years ago

Thank you.

wisekh6 commented 7 years ago

I have further questions.

  1. What is a best way of interpreting a linear list of amplicons?
  2. Can we estimate a fraction of circular amplicons relative to non-circular amplicons within a region?

Thank you in advance,

virajbdeshpande commented 7 years ago

Hi, Sorry for the slow reply.

Under perfect conditions, one would expect ecDNA to form circular cycles. But the answer is more nuanced and will require a more structured analysis. Here's why:

  1. Missing or false edges due to repeats or noisy sequence data can break the circles.
  2. If the amplicon has a repeat, AA flexibly separates the circles into 2 or more smaller circles. These may be merged through the web interface in different orders to obtain one or more larger circles representing possible candidate structures.
  3. If the amplicon consists of multiple similar structures which differ slightly from each other and have varying copy numbers, such a signal needs to be evaluated carefully.
  4. Tandem repeats on the chromosomes can also result in circular structures.

In the current commit, I do report the information for horizontal bars, but I will have this in the next commit. The next commit also has some improvements in performance. I will push it this weekend.

I am reporting the coordinates using pysam which uses a 0-based coordinate system. But I have not verified this independently. A simple test is that if you find a discordant edge with split reads, you can look at the sequence of the split read and compare its location on the reference.

I have an optimization routine which predicts copy counts for edges in the breakpoint graph as a final step. (Note, these counts may differ slightly from the horizontal bars reported by meanshift in coverage). The PredictedCopyCount reports these estimates and these are used to decompose the graph into cycles. The copy count of the cycles may not add up to the PredictedCopyCount, because I am not reporting cycles with low copy numbers. AverageCoverage is an estimate for the average number of reads mapping per base pair for that sequence edge.

Copy counts for cycles are simply the predicted copy numbers for each cycle.

A linear amplicon which contains a 0 can be interpreted as an amplified region where we do report where the end points are connected. These may be connected within the amplicon or at other locations in the genome. For example, if a circular ecDNA contains a highly repetitive region, bwa will not map any reads to the highly repetitive region, and AA might report a linear amplicon where the end points are the out ends of the highly repetitive region.

Estimating fraction of circular amplicons to non-circular within a region is tricky. For example, an HSR with tandemly duplicated repeats will appear as circular from sequencing data.

I think the takeaways here are that even though we can push the limit of exploiting the sequencing data, there are inherent limitations and there are open questions about how can you make sense of it in a systematic fashion.

wisekh6 commented 7 years ago

Thank you so much for your comprehensive explanations while you would be busy in your work.

My comments are (, although you may not agree with mine):

I am not reporting cycles with low copy numbers

I believe such cycles with low copy numbers will be informative as well. It would be great if a new version of your tool comes with an option about whether or not having such cycles.

Estimating fraction of circular amplicons to non-circular within a region is tricky. For example, an HSR with tandemly duplicated repeats will appear as circular from sequencing data.

I agree it could be misleading in some risky regions, but careful interpretation of those fractions will be also very helpful in biological aspects.

If you don't mind, I would open this ticket until I don't have any question about the output.

Thanks a lot,