marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
658 stars 179 forks source link

Is that possible to roughly estimate time left for correction? #1457

Closed hungweichen0327 closed 5 years ago

hungweichen0327 commented 5 years ago

Dear community,

I would like to ask how to know which step the canu is working? Is that possible to roughly estimate the time left canu run for read correction?

So far I saw the Crystal_all_1kb.corStore.WORKING folder in the correction folder. But actually I don't know the corStore means? Is that at the last few steps in canu correction process?

--                            (tag)Concurrency
--                     (tag)Threads          |
--            (tag)Memory         |          |
--        (tag)         |         |          |     total usage     algorithm
--        -------  ------  --------   --------  -----------------  -----------------------------
-- Local: meryl     24 GB    8 CPUs x  10 jobs   240 GB   80 CPUs  (k-mer counting)
-- Local: hap       12 GB   20 CPUs x   4 jobs    48 GB   80 CPUs  (read-to-haplotype assignment)
-- Local: cormhap   32 GB   16 CPUs x   5 jobs   160 GB   80 CPUs  (overlap detection with mhap)
-- Local: obtovl    16 GB   16 CPUs x   5 jobs    80 GB   80 CPUs  (overlap detection)
-- Local: utgovl    16 GB   16 CPUs x   5 jobs    80 GB   80 CPUs  (overlap detection)
-- Local: ovb        4 GB    1 CPU  x  80 jobs   320 GB   80 CPUs  (overlap store bucketizer)
-- Local: ovs       16 GB    1 CPU  x  80 jobs  1280 GB   80 CPUs  (overlap store sorting)
-- Local: red       12 GB    8 CPUs x  10 jobs   120 GB   80 CPUs  (read error detection)
-- Local: oea        4 GB    1 CPU  x  80 jobs   320 GB   80 CPUs  (overlap error adjustment)
-- Local: bat      256 GB   16 CPUs x   1 job    256 GB   16 CPUs  (contig construction with bogart)
-- Local: gfa       16 GB   16 CPUs x   1 job     16 GB   16 CPUs  (GFA alignment and processing)

I would like to know which steps spending most of time and disk space if possible. Thank you!

skoren commented 5 years ago

You're more than halfway through correction, the only step left is filtering to select which reads will be used for correction (the step running now which is single core so may take a couple of days) and then calling a consensus for the corrected reads. The actual time will depend on your disk I/O performance and how many cores you have available for the jobs to run.

In terms of space, you've passed the peak space usage, from here forward your space usage should decrease. After correction canu will remove the Crystal_all_1kb.ovlStore folder which is the biggest user of space in the assembly.

hungweichen0327 commented 5 years ago

Dear Sergey Koren, Thank you for the reply. The step of filtering to select which reads was finished within one day. Now it's calling a consensus for the corrected reads. One more thing I want to confirm: Finally the longest 40X reads will be corrected (by default) means all the raw reads, including longest 40X raw reads, are used to correct the longest 40X (or slightly over 40X) raw reads. In the end, about 40X longest corrected reads are generated in fasta file. Is that right?

skoren commented 5 years ago

Yes, all data is used to select/correct the longest 40x based on coverage as well. That is, a 100kb read that only has support for 2kb of its sequence from other reads is not considered a long read. Canu will also correct more than 40x if it identifies read sequences which aren't represented by the longest subset. This often happens with mitochondria or plasmids.

If the filtering step is complete, you can check your report file which will have a table summarizing how many reads were selected and their statistics (expected coverage, expected read n50, etc).

hungweichen0327 commented 5 years ago

Thank you for the explanation. It's very clear! There is no more qustions related to this issue.

hungweichen0327 commented 5 years ago

Dear Dr. Sergey,

Is that possible to roughly estimate the time spending for assembly? It spends about 2 weeks for correction. 40X coverage (580m genome size) will do assembly.

The script for assembly was below:

canu -assemble -p Crystal -d Crystal-ErRate0.144 genomeSize=508m correctedErrorRate=0.144 -nanopore-corrected Crystal_all_1kb.correctedReads.fasta

PS: I didn't use HAC for basecalling by Guppy, so I set Crystal-ErRate0.144.

And the summary of corrected reads is below:

Assembly                    Crystal_all_1kb.correctedReads
# contigs (>= 0 bp)         809565                        
# contigs (>= 1000 bp)      805099                        
# contigs (>= 5000 bp)      626721                        
# contigs (>= 10000 bp)     598880                        
# contigs (>= 25000 bp)     567057                        
# contigs (>= 50000 bp)     99846                         
Total length (>= 0 bp)      23977791799                   
Total length (>= 1000 bp)   23973889473                   
Total length (>= 5000 bp)   23461933413                   
Total length (>= 10000 bp)  23256483983                   
Total length (>= 25000 bp)  22764687731                   
Total length (>= 50000 bp)  6426913098                    
# contigs                   809452                        
Largest contig              245030                        
Total length                23977749846                   
GC (%)                      33.67                         
N50                         38764                         
N75                         31178                         
L50                         227410                        
L75                         400597    
skoren commented 5 years ago

I expect it will take a similar time or longer assemble as to correct, perhaps a bit longer (2-4 weeks). There tend to be a few reads that take longer to find overlaps for due to their length which means all jobs but one finish fast but you are stuck waiting for the last job to finish.

hungweichen0327 commented 5 years ago

Thank you.