iprada / Circle-Map

A method for circular DNA detection based on probabilistic mapping of ultrashort reads
MIT License
62 stars 20 forks source link

How to speed up the last step of Circle-Map Realign? #35

Open junchend opened 4 years ago

junchend commented 4 years ago

Dear iprada,

Recently, I am trying to use Circle-Map to perform eccRNA calling, I followed the tutorial you posted here using Realign module, my issues are:

  1. what's the difference between Realign and Repeats modules, which one is the best to calling eccDNA using WGS data?

  2. How to speed up the last step of Realign? I used the following codes as:

/public/software/Anaconda3/bin/Circle-Map Realign \ -t 20 \ -i S1_sort_circular_read_candidates.bam \ -qbam S1_qname_unknown_circle.bam \ -sbam S1_sorted_unknown_circle.bam \ -fasta /public/genomes/Hsapiens/hg38/seq/hg38.fa \ -o S1_unknown_circle.bed \ &> S1_Circle-Map_detect.log

I used HPC, the single node I requested "nodes=1:ppn=20,mem=64g" I can get the bed files for eccDNA, however, always need 5 days or more to run the last step.

Should I downsample the bam file?

Thanks so much!

Juncheng

iprada commented 4 years ago

Dear Juncheng,

Thanks a lot for using Circle-Map.

Regarding:

what's the difference between Realign and Repeats modules, which one is the best to calling eccDNA using WGS data?

The realign module is designed to detect circles based on split reads, and it should the primary module to use for detecting circles. Use this one

The repeats module was designed for part of my PhD on the yeast genome. In yeast, many circles form recurrently from repetitive parts of the genome and we were not able to detect them using split read based methods. I developed Repeats to detect those coming from the hard-to-align regions. You can read the details here (https://www.biorxiv.org/content/10.1101/2020.02.11.943357v1.abstract)

Regarding:

How to speed up the last step of Realign?

I need to know more to be able to help you. Could you post the last 10 lines of the log file?

junchend commented 4 years ago

Dear iprada,

Thanks so much for the quick response.

For the second issue, from the log file, I got the following message:

^M118it [11:02:20, 377.03s/it]ESC[A^M 4%|▍ | 119/3000 [11:08:26<299:08:36, 373.80s/it] ^M119it [11:08:26, 373.81s/it]ESC[A^M 4%|▍ | 120/3000 [11:14:56<302:55:08, 378.65s/it] ^M120it [11:14:56, 378.64s/it]ESC[A^M 4%|▍ | 121/3000 [11:48:13<691:04:38, 864.15s/it] ^M121it [11:48:13, 864.14s/it]ESC[A^M 4%|▍ | 122/3000 [11:49:50<506:57:23, 634.14s/it] ^M122it [11:49:50, 634.14s/it]ESC[A^M 4%|▍ | 123/3000 [11:52:07<387:28:17, 484.84s/it] ^M123it [11:52:07, 484.84s/it]ESC[A^M 4%|▍ | 124/3000 [11:53:16<287:44:04, 360.17s/it] ^M124it [11:53:16, 360.17s/it]ESC[A^M 4%|▍ | 125/3000 [11:57:55<268:13:34, 335.87s/it] ^M125it [11:57:55, 335.86s/it]ESC[A^M 4%|▍ | 126/3000 [11:58:03<189:34:34, 237.47s/it] ^M126it [11:58:03, 237.46s/it]ESC[A^M 4%|▍ | 127/3000 [12:10:27<310:53:40, 389.56s/it] ^M127it [12:10:27, 389.58s/it]ESC[A^M 4%|▍ | 128/3000 [12:26:36<449:15:19, 563.13s/it] ^M128it [12:26:36, 563.13s/it]ESC[A^M 4%|▍ | 129/3000 [12:50:12<653:20:50, 819.24s/it] ^M129it [12:50:12, 819.25s/it]ESC[A^M 4%|▍ | 130/3000 [12:55:59<540:11:11, 677.59s/it] ^M130it [12:55:59, 677.59s/it]ESC[A^M 4%|▍ | 131/3000 [13:13:28<628:36:58, 788.78s/it] ^M131it [13:13:28, 788.78s/it]ESC[A^M 4%|▍ | 132/3000 [13:29:14<665:58:51, 835.96s/it] ^M132it [13:29:14, 835.96s/it]ESC[A^M 4%|▍ | 133/3000 [13:33:22<525:16:09, 659.56s/it] ^M133it [13:33:22, 659.56s/it]ESC[A^M 4%|▍ | 134/3000 [13:37:06<421:01:51, 528.86s/it] ^M134it [13:37:06, 528.86s/it]ESC[A^M 4%|▍ | 135/3000 [13:37:25<299:12:40, 375.97s/it] ^M135it [13:37:25, 375.97s/it]ESC[A^M 5%|▍ | 136/3000 [13:37:59<217:31:44, 273.43s/it] ^M136it [13:37:59, 273.43s/it]ESC[A^M 5%|▍ | 137/3000 [13:43:32<231:44:04, 291.39s/it] ^M137it [13:43:32, 291.39s/it]ESC[A^M 5%|▍ | 138/3000 [13:46:31<204:47:57, 257.61s/it] ^M138it [13:46:31, 257.61s/it]ESC[A

the percent moved slowly, and some log files showed:

^M849it [46:50, 1.34s/it]ESC[A^M 28%|██▊ | 850/3000 [46:51<34:52, 1.03it/s] ^M850it [46:51, 1.03it/s]ESC[A^M 28%|██▊ | 851/3000 [46:55<1:11:36, 2.00s/it] ^M851it [46:55, 2.00s/it]ESC[A^M 28%|██▊ | 852/3000 [46:59<1:33:01, 2.60s/it] ^M852it [46:59, 2.60s/it]ESC[A^M 28%|██▊ | 853/3000 [47:01<1:23:35, 2.34s/it] ^M853it [47:01, 2.34s/it]ESC[A^M 28%|██▊ | 854/3000 [47:06<1:55:53, 3.24s/it] ^M854it [47:06, 3.24s/it]ESC[A^M 28%|██▊ | 855/3000 [47:06<1:25:22, 2.39s/it] ^M855it [47:06, 2.39s/it]ESC[A^M 29%|██▊ | 856/3000 [47:11<1:52:41, 3.15s/it] ^M856it [47:11, 3.15s/it]ESC[A^M 29%|██▊ | 857/3000 [47:12<1:24:13, 2.36s/it] ^M857it [47:12, 2.36s/it]ESC[A^M 29%|██▊ | 858/3000 [47:12<1:00:28, 1.69s/it] ^M858it [47:12, 1.69s/it]ESC[A^M 29%|██▊ | 859/3000 [47:12<44:47, 1.26s/it] ^M ^MESC[A^M 29%|██▊ | 859/3000 [47:13<1:57:41, 3.30s/it] 2020-05-13 17:53:26: An error happenend during execution. Exiting

or

^M2934it [2:16:33, 1.12it/s]ESC[A^M 98%|█████████▊| 2935/3000 [2:16:37<01:43, 1.59s/it] ^M2935it [2:16:37, 1.59s/it]ESC[A^M 98%|█████████▊| 2936/3000 [2:16:43<03:12, 3.01s/it] ^M2936it [2:16:43, 3.01s/it]ESC[A^M 98%|█████████▊| 2937/3000 [2:16:46<03:06, 2.96s/it] ^M2937it [2:16:46, 2.96s/it]ESC[A^M 98%|█████████▊| 2938/3000 [2:16:49<03:14, 3.13s/it] ^M2938it [2:16:49, 3.13s/it]ESC[A^M 98%|█████████▊| 2939/3000 [2:16:50<02:15, 2.23s/it] ^M2939it [2:16:50, 2.23s/it]ESC[A^M 98%|█████████▊| 2940/3000 [2:16:55<03:03, 3.06s/it] ^M2940it [2:16:55, 3.06s/it]ESC[A^M 98%|█████████▊| 2941/3000 [2:16:57<02:50, 2.90s/it] ^M ^MESC[A^M 98%|█████████▊| 2941/3000 [2:16:58<02:44, 2.79s/it] 2020-05-13 19:15:05: An error happenend during execution. Exiting ^M2940it [2:17:01, 2.80s/it]

No other message could find.

So please help me check the process, how to solve the error or speed up the last step.

Thanks so much!

Juncheng

iprada commented 4 years ago

Dear Juncheng,

I assume the log you are showing is the log output of the command you show above:

/public/software/Anaconda3/bin/Circle-Map Realign -t 20 -i S1_sort_circular_read_candidates.bam -qbam S1_qname_unknown_circle.bam -sbam S1_sorted_unknown_circle.bam -fasta /public/genomes/Hsapiens/hg38/seq/hg38.fa -o S1_unknown_circle.bed &> S1_Circle-Map_detect.log

Could you execute your command as follows and send me the files sderr.txt and stdout.txt:

/public/software/Anaconda3/bin/Circle-Map Realign -t 20 -i S1_sort_circular_read_candidates.bam -qbam S1_qname_unknown_circle.bam -sbam S1_sorted_unknown_circle.bam -fasta /public/genomes/Hsapiens/hg38/seq/hg38.fa -o S1_unknown_circle.bed > S1_stdout.txt 2> stderr.txt

Then I will have a look at the errors and try to help you.

Regarding, speed:

Can elaborate a bit on the setup you have? number of reads, type of data, organism... If you are not comfortable on writing that here, you can write me a mail This will allow me to give you some hints about speeding up your search.

Best,

Inigo

junchend commented 4 years ago

Dear Prof. Inigo,

I have sent a mail ..., is it your E-mail address?

Thanks !

Juncheng

iprada commented 4 years ago

Dear Junchend,

I usually do not read my emails in the weekend. I will answer you mail in a few minutes. I have also edited your post to remove my email.

Best,

Inigo

junchend commented 4 years ago

Dear Prof. Inigo,

Thanks so much for the response, I will carefully read the suggestion from the mail.

Hope a new version of Circle-Map for deep coverage situation will come to us in the near future.

It will help many of the researchers for finding eccDNAs.

Best,

Juncheng

iprada commented 4 years ago

Dear Junchend,

Thanks a lot for the nice words. I have been working on a new version for a while, and I hope to get that soon over the next couple of months.

Best,

Inigo

panxiaoguang commented 4 years ago

Dear iprada,

Thanks so much for the quick response.

For the second issue, from the log file, I got the following message:

^M118it [11:02:20, 377.03s/it]ESC[A^M 4%|▍ | 119/3000 [11:08:26<299:08:36, 373.80s/it] ^M119it [11:08:26, 373.81s/it]ESC[A^M 4%|▍ | 120/3000 [11:14:56<302:55:08, 378.65s/it] ^M120it [11:14:56, 378.64s/it]ESC[A^M 4%|▍ | 121/3000 [11:48:13<691:04:38, 864.15s/it] ^M121it [11:48:13, 864.14s/it]ESC[A^M 4%|▍ | 122/3000 [11:49:50<506:57:23, 634.14s/it] ^M122it [11:49:50, 634.14s/it]ESC[A^M 4%|▍ | 123/3000 [11:52:07<387:28:17, 484.84s/it] ^M123it [11:52:07, 484.84s/it]ESC[A^M 4%|▍ | 124/3000 [11:53:16<287:44:04, 360.17s/it] ^M124it [11:53:16, 360.17s/it]ESC[A^M 4%|▍ | 125/3000 [11:57:55<268:13:34, 335.87s/it] ^M125it [11:57:55, 335.86s/it]ESC[A^M 4%|▍ | 126/3000 [11:58:03<189:34:34, 237.47s/it] ^M126it [11:58:03, 237.46s/it]ESC[A^M 4%|▍ | 127/3000 [12:10:27<310:53:40, 389.56s/it] ^M127it [12:10:27, 389.58s/it]ESC[A^M 4%|▍ | 128/3000 [12:26:36<449:15:19, 563.13s/it] ^M128it [12:26:36, 563.13s/it]ESC[A^M 4%|▍ | 129/3000 [12:50:12<653:20:50, 819.24s/it] ^M129it [12:50:12, 819.25s/it]ESC[A^M 4%|▍ | 130/3000 [12:55:59<540:11:11, 677.59s/it] ^M130it [12:55:59, 677.59s/it]ESC[A^M 4%|▍ | 131/3000 [13:13:28<628:36:58, 788.78s/it] ^M131it [13:13:28, 788.78s/it]ESC[A^M 4%|▍ | 132/3000 [13:29:14<665:58:51, 835.96s/it] ^M132it [13:29:14, 835.96s/it]ESC[A^M 4%|▍ | 133/3000 [13:33:22<525:16:09, 659.56s/it] ^M133it [13:33:22, 659.56s/it]ESC[A^M 4%|▍ | 134/3000 [13:37:06<421:01:51, 528.86s/it] ^M134it [13:37:06, 528.86s/it]ESC[A^M 4%|▍ | 135/3000 [13:37:25<299:12:40, 375.97s/it] ^M135it [13:37:25, 375.97s/it]ESC[A^M 5%|▍ | 136/3000 [13:37:59<217:31:44, 273.43s/it] ^M136it [13:37:59, 273.43s/it]ESC[A^M 5%|▍ | 137/3000 [13:43:32<231:44:04, 291.39s/it] ^M137it [13:43:32, 291.39s/it]ESC[A^M 5%|▍ | 138/3000 [13:46:31<204:47:57, 257.61s/it] ^M138it [13:46:31, 257.61s/it]ESC[A

the percent moved slowly, and some log files showed:

^M849it [46:50, 1.34s/it]ESC[A^M 28%|██▊ | 850/3000 [46:51<34:52, 1.03it/s] ^M850it [46:51, 1.03it/s]ESC[A^M 28%|██▊ | 851/3000 [46:55<1:11:36, 2.00s/it] ^M851it [46:55, 2.00s/it]ESC[A^M 28%|██▊ | 852/3000 [46:59<1:33:01, 2.60s/it] ^M852it [46:59, 2.60s/it]ESC[A^M 28%|██▊ | 853/3000 [47:01<1:23:35, 2.34s/it] ^M853it [47:01, 2.34s/it]ESC[A^M 28%|██▊ | 854/3000 [47:06<1:55:53, 3.24s/it] ^M854it [47:06, 3.24s/it]ESC[A^M 28%|██▊ | 855/3000 [47:06<1:25:22, 2.39s/it] ^M855it [47:06, 2.39s/it]ESC[A^M 29%|██▊ | 856/3000 [47:11<1:52:41, 3.15s/it] ^M856it [47:11, 3.15s/it]ESC[A^M 29%|██▊ | 857/3000 [47:12<1:24:13, 2.36s/it] ^M857it [47:12, 2.36s/it]ESC[A^M 29%|██▊ | 858/3000 [47:12<1:00:28, 1.69s/it] ^M858it [47:12, 1.69s/it]ESC[A^M 29%|██▊ | 859/3000 [47:12<44:47, 1.26s/it] ^M ^MESC[A^M 29%|██▊ | 859/3000 [47:13<1:57:41, 3.30s/it] 2020-05-13 17:53:26: An error happenend during execution. Exiting

or

^M2934it [2:16:33, 1.12it/s]ESC[A^M 98%|█████████▊| 2935/3000 [2:16:37<01:43, 1.59s/it] ^M2935it [2:16:37, 1.59s/it]ESC[A^M 98%|█████████▊| 2936/3000 [2:16:43<03:12, 3.01s/it] ^M2936it [2:16:43, 3.01s/it]ESC[A^M 98%|█████████▊| 2937/3000 [2:16:46<03:06, 2.96s/it] ^M2937it [2:16:46, 2.96s/it]ESC[A^M 98%|█████████▊| 2938/3000 [2:16:49<03:14, 3.13s/it] ^M2938it [2:16:49, 3.13s/it]ESC[A^M 98%|█████████▊| 2939/3000 [2:16:50<02:15, 2.23s/it] ^M2939it [2:16:50, 2.23s/it]ESC[A^M 98%|█████████▊| 2940/3000 [2:16:55<03:03, 3.06s/it] ^M2940it [2:16:55, 3.06s/it]ESC[A^M 98%|█████████▊| 2941/3000 [2:16:57<02:50, 2.90s/it] ^M ^MESC[A^M 98%|█████████▊| 2941/3000 [2:16:58<02:44, 2.79s/it] 2020-05-13 19:15:05: An error happenend during execution. Exiting ^M2940it [2:17:01, 2.80s/it]

No other message could find.

So please help me check the process, how to solve the error or speed up the last step.

Thanks so much!

Juncheng

I Have the same problem with you, how can you solve it

iprada commented 4 years ago

HI @panxiaoguang

I am copying and pasting from issue #37

" I have been looking a bit into the code to see what is going on and I have narrow it down to a few lines of code that perform poorly.

Briefly described, this is what is going on:

Circle-Map collects realigned intervals, denoted as n, and discordant read intervals, denoted as m, independently. Then, it will simply assign the m discordant intervals that are within a reasonable distance (insert size mean + 3 standard deviations) to the n realigned intervals.

The computation described above has a complexity of O(nm) ~ O(n**2). This has never been a problem for me in all the data has gone through Circle-Map. However, it seems reasonable that this can be a problem in some corner cases when n and m are very large.

I should of course fix this, I am submitting my PhD the 1st of October. I will look at this afterwards

Best,

Iñigo

cchd0001 commented 2 years ago

HI @panxiaoguang

I am copying and pasting from issue #37

" I have been looking a bit into the code to see what is going on and I have narrow it down to a few lines of code that perform poorly.

Briefly described, this is what is going on:

Circle-Map collects realigned intervals, denoted as n, and discordant read intervals, denoted as m, independently. Then, it will simply assign the m discordant intervals that are within a reasonable distance (insert size mean + 3 standard deviations) to the n realigned intervals.

The computation described above has a complexity of O(nm) ~ O(n**2). This has never been a problem for me in all the data has gone through Circle-Map. However, it seems reasonable that this can be a problem in some corner cases when n and m are very large.

I should of course fix this, I am submitting my PhD the 1st of October. I will look at this afterwards

Best,

Iñigo

Dr. Prada-Luengo,

I am also troubled by this issue. Is it possible for you to provide more details about the "few lines of code" ? Like which function or the line number and filename of thoes codes.

Best wishes Lidong Guo