Closed hungweichen0327 closed 5 years ago
You're more than halfway through correction, the only step left is filtering to select which reads will be used for correction (the step running now which is single core so may take a couple of days) and then calling a consensus for the corrected reads. The actual time will depend on your disk I/O performance and how many cores you have available for the jobs to run.
In terms of space, you've passed the peak space usage, from here forward your space usage should decrease. After correction canu will remove the Crystal_all_1kb.ovlStore folder which is the biggest user of space in the assembly.
Dear Sergey Koren, Thank you for the reply. The step of filtering to select which reads was finished within one day. Now it's calling a consensus for the corrected reads. One more thing I want to confirm: Finally the longest 40X reads will be corrected (by default) means all the raw reads, including longest 40X raw reads, are used to correct the longest 40X (or slightly over 40X) raw reads. In the end, about 40X longest corrected reads are generated in fasta file. Is that right?
Yes, all data is used to select/correct the longest 40x based on coverage as well. That is, a 100kb read that only has support for 2kb of its sequence from other reads is not considered a long read. Canu will also correct more than 40x if it identifies read sequences which aren't represented by the longest subset. This often happens with mitochondria or plasmids.
If the filtering step is complete, you can check your report file which will have a table summarizing how many reads were selected and their statistics (expected coverage, expected read n50, etc).
Thank you for the explanation. It's very clear! There is no more qustions related to this issue.
Dear Dr. Sergey,
Is that possible to roughly estimate the time spending for assembly? It spends about 2 weeks for correction. 40X coverage (580m genome size) will do assembly.
The script for assembly was below:
canu -assemble -p Crystal -d Crystal-ErRate0.144 genomeSize=508m correctedErrorRate=0.144 -nanopore-corrected Crystal_all_1kb.correctedReads.fasta
PS: I didn't use HAC for basecalling by Guppy, so I set Crystal-ErRate0.144.
And the summary of corrected reads is below:
Assembly Crystal_all_1kb.correctedReads
# contigs (>= 0 bp) 809565
# contigs (>= 1000 bp) 805099
# contigs (>= 5000 bp) 626721
# contigs (>= 10000 bp) 598880
# contigs (>= 25000 bp) 567057
# contigs (>= 50000 bp) 99846
Total length (>= 0 bp) 23977791799
Total length (>= 1000 bp) 23973889473
Total length (>= 5000 bp) 23461933413
Total length (>= 10000 bp) 23256483983
Total length (>= 25000 bp) 22764687731
Total length (>= 50000 bp) 6426913098
# contigs 809452
Largest contig 245030
Total length 23977749846
GC (%) 33.67
N50 38764
N75 31178
L50 227410
L75 400597
I expect it will take a similar time or longer assemble as to correct, perhaps a bit longer (2-4 weeks). There tend to be a few reads that take longer to find overlaps for due to their length which means all jobs but one finish fast but you are stuck waiting for the last job to finish.
Thank you.
Dear community,
I would like to ask how to know which step the canu is working? Is that possible to roughly estimate the time left canu run for read correction?
So far I saw the
Crystal_all_1kb.corStore.WORKING
folder in thecorrection
folder. But actually I don't know the corStore means? Is that at the last few steps in canu correction process?I would like to know which steps spending most of time and disk space if possible. Thank you!