virajbdeshpande / AmpliconArchitect

AmpliconArchitect (AA) is a tool to identify one or more connected genomic regions which have simultaneous copy number amplification and elucidates the architecture of the amplicon. In the current version, AA takes as input next generation sequencing reads (paired-end Illumina reads) mapped to the hg19/GRCh37 reference sequence and one or more regions of interest. Please "watch" this repository for improvements in runtime, accuracy and annotations for GRCh38 human reference genome coming up soon.
Other
131 stars 41 forks source link

Run stalled #21

Open asalcedo31 opened 5 years ago

asalcedo31 commented 5 years ago

Hello,

I've been trying to run amplicon architect for several days now but the run reaches a certain point and does not progress further without giving any errors or terminating. I am running it with these settings:

ref: hg19 downsample 0 sensitivems True

I get the following warnings:

/oicr/local/boutroslab/sw/Python-BL/2.7.13/lib/python2.7/site-packages/numpy/core/fromnumeric.py:3194: RuntimeWarning: Degrees of freedom <= 0 for slice **kwargs) /oicr/local/boutroslab/sw/Python-BL/2.7.13/lib/python2.7/site-packages/numpy/core/_methods.py:127: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount)

The last line on the log is: discordant edges: fetch discordant chr2 33031296 33241971 15 2018 103552

It has been five days since anything was written to the log and I don't see any other signs of progress. I tried repeating the run and it stalls at the same point. Do you know what might be the problem or how I might be able to investigate it further?

Thank you!

virajbdeshpande commented 5 years ago

Hello asalcedo31,

Thank you for raising the issue. It will help me improve the software and documentation.

I suspect this is a performance issue. The warning is not the source of the problem. Here are a few things to try. 1) Is this a sample where you are trying to detect a viral integration? If not, I will strongly recommend the default setting for sensitivems=False. If yes, make sure the seed interval consists only of the viral genome and nothing else. 2) Do you have a rough estimate of the copy number of the amplicon you are trying to reconstruct? If so, you can set downsample to 10/CN. 3) I will release a new version of AA in the next 2-3 days which includes improvement in performance. 4) If you have a prior notion of the set of intervals within each amplicon, you can split the bed file into multiple files and run AA on each amplicon in parallel with --extendmode CLUSTERED. 5) Do you at least see the output messages like: "Loading libraries", "Initiating bam_to_breakpoint object", "Exploring interval"? If not, then, when you kill AA, at what line in the source code does it exit?

It will be great to hear from you and I will be happy to answer follow-up questions. Best, Viraj

asalcedo31 commented 5 years ago

Hi Viraj,

Many thanks for your suggestions. I was able to run the latest version with just intervals covering the focal amplification and it finished successfully. It has not yet finished running copy number segments across the whole genome but is still making progress after ~12 hours. However, I am not able to reach the web interface to visualize the outputs, going to genomequery.ucsd.edu:8800 on my browser says "genomequery.ucsd.edu refused to connect." .

Thanks again,

Adriana

virajbdeshpande commented 5 years ago

Hello Adriana,

You should be able to access the web interface now, but I working on a more permanent configuration of the server.

It is fantastic you were able to run the latest version. It should not make a difference from the user perspective, but are you running AA using docker or from the source code like before?

Can you clarify "not yet finished running copy number segments across the whole genome"? Is this the ReadDepth segmentation or are you providing the whole genome bed file to AA? If it is the second, then AA is only designed to look at relatively small intervals on the genome (up to 10Mbp) and not perform segmentation of the whole genome.

asalcedo31 commented 5 years ago

Hi Viraj,

Thank you for getting back to me so quickly. I am running from the source code (I just did a "git pull") and its working well. Sorry for the unclear comment, I was trying to run it with a bed file containing CN calls from across the whole genome, not just the focal amplification. I will try to see if I can get it to complete when I provide just the focal amplification but use -extendmode explore. Also I'm not totally clear on how to interpret the output, I'm specifically interested in seeing if there is evidence that my amplicon is circular ecDNA (eg a double minute). What would you consider strong evidence of that? I see some edges connecting the beginning and end of my amplicon with good read support. Is that sufficient?

virajbdeshpande commented 5 years ago

If you see the edge going from the end of the amplicon to the beginning and the intermediate segment appears roughly intact (or at most some deletions) that should be evidence of a simple circular amplicon. As a further check, when you visualize the cycles files, do you see a closed cycle (as opposed to an open path) with high copy number that connects the beginning of the amplicon to the end of the amplicon?

virajbdeshpande commented 5 years ago

Generally speaking, if it is a high copy focal amplification, even if AA fails to reconstruct a complete cycle, it is highly likely to be of ecDNA origin. By ecDNA origin, I mean, it could be either an ecDNA or HSR which was formed by reintegration of ecDNA into the chromosome.

alvinwt commented 4 years ago

Hi @virajbdeshpande I met the same issue with the run time stalling. Is there a big difference in the output, if I run amplicon architect in CLUSTERED mode and the usual full mode? I tried the different modes for one or 2 segments and in the full mode is see a lot of links to hs37d5, but the clustered mode might miss some interchromosomal events. I am working on some genomes with complex rearrangement and I will be glad to get your help.

virajbdeshpande commented 4 years ago

Hi Alvin,

Thanks for trying AA. Your interpretation is correct.

In the clustered mode, AA treats all seed intervals in the input bed file as part of the same amplicon. If the list of seed intervals consists of multiple amplicons, you will simple get a graph with more than 1 connected component.

There is one additional thing the EXPLORE mode does, it is that it searches for new intervals in addition to the seeds provided by you (occasionally detecting some false positive intervals). If you run in CLUSTERED mode and miss some intervals, AA should still be able to reconstruct the partial structure of the amplicon.

In general, most amplicons should be single interval amplicons and CNV calling tools should predict the large intervals of the multi-interval amplicons.

I will add some parallelization and optimizations to the explore mode eventually. But if you want, you can try the following hack to parallelize the computation: 1) Run AA in EXPLORE mode separately on each seed (split your bed file into multiple bedfiles with 1 seed in each file). Let it run until the exploration stage is done (or run in SVVIEW mode). 2) Manually look at the extended set of intervals computed by AA for each seed and compute the connected component of all intervals from all seeds. 3) Create new bed files with each connected component in a separate bed file. 4) Run AA in FULL + CLUSTERED mode on each connected component.

LiuSXin commented 4 years ago

Hello Viraj, Recently I ran amplicon architect and come across exactly the same issue like @asalcedo31 . I used the latest version of AA from docker image and tried separating the bed files and running in EXPLORE mode, but neither of them worked. The procedure stalled with the last several lines of the log file like : DEBUG:root:#TIME 2653.620 discordant edges: fetch discordant chr2 33040880 33251637 3 118791 2113 DEBUG:root:#TIME 2653.713 discordant edges: fetch discordant chr2 33040880 33251637 3 118747 2080 The arguments of AA I ran is : $AA --bam ${bam} --bed ${bed} --out ${out}/${i} --ref hg19 --downsample 0 --extendmode EXPLORE I don't know how @asalcedo31 solved this problem and I am anxious to get your help. Best wishes, Shangxin

virajbdeshpande commented 4 years ago

Hello Shangxin, Can you try this branch and let me know if this shows an improvement in runtime?

Best, Viraj

alvinwt commented 4 years ago

@virajbdeshpande Which is the branch that would be a fix? I want to try this out as there are some stalled runs for my project as well.

virajbdeshpande commented 4 years ago

@alvinwt, thanks for pointing this out. I forgot to include the link in my previous message. Here: https://github.com/virajbdeshpande/AmpliconArchitect/tree/runtime_discordant_clustering

LiuSXin commented 4 years ago

Hello viraj, I tried your new branch but it still didn't work. So far the log file was ended as : DEBUG:root:#TIME 2683.294 discordant edges: fetch discordant chr2 33030880 33040880 2 3 DEBUG:root:#TIME 2683.295 discordant edges: discordant filter neighborhood chr2 33030880 33040880 2 3 DEBUG:root:#TIME 2683.295 discordant edges: merge and sort discordant chr2 33030880 33040880 0 0 DEBUG:root:#TIME 2683.295 discordant edges: local edges done chr2 33030880 33040880 0 0 0 DEBUG:root:#TIME 2683.295 discordant edges: external edges done chr2 33030880 33040880 0 0 DEBUG:root:#TIME 2683.428 discordant edges chr2 33251638 33261637 DEBUG:root:#TIME 2683.428 discordant edges chr2 33030880 33040879 DEBUG:root:#TIME 2748.064 discordant edges chr2 33040880 33251637 3 DEBUG:root:#TIME 2810.526 discordant edges: fetch discordant chr2 33040880 33251637 3 118791 2113 DEBUG:root:#TIME 2810.617 discordant edges: fetch discordant chr2 33040880 33251637 3 118747 2080 And now it is stalled. Do you know how to solve this? Thanks for your reply.

LiuSXin commented 4 years ago

Is it possible that this situation happened because I use the annotation file (gtf) from GENCODE instead of yours in AA_DATA_REPO? And I didn't use ReadDepth to call CNV. I used CNVkit. Could they result in the faults I came across? I am looking forward to hear from you soon. Thanks.

alvinwt commented 4 years ago

Hi Viraj and team, the new branch has solved some of the stalled runs and I have come across another reason for the stalled run and crashes.

In my crashed cases, ne is None and the line with resulting in the error is in the link. I looked at IGV for the edge and it seems like Amplicon architect is doing the right thing but I am not sure why add_edge is returning None.

https://github.com/virajbdeshpande/AmpliconArchitect/blob/3834a8732b9dff9c03e211f80b6f7c0918e8fe70/src/bam_to_breakpoint.py#L2006