marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
659 stars 179 forks source link

Feasibility of running Canu to assemble 80 million PacBio reads #186

Closed xiwang2405 closed 8 years ago

xiwang2405 commented 8 years ago

Hi,

We have assembly project to assemble a quite big plant genome, for which we are generating about 80 million PacBio reads (40-50x coverage). My question is has anyone ever had experience on running so big amount of data using Canu? Any suggestions?

Based on testing small set of PacBio data (5 million reads), it took me ~1 week to finish assembly using ~400 CPUs. Since computational time increases quadratic over number of reads due to all-against-all comparison for both error correction and assembly steps, I estimated that running through Canu on 80 million reads using the same amount of CPUs will take (80 million/5 million)\ = 256 weeks, which is very very long time. Right now I am thinking about solution using cloud computing. In theory if the job can be parallel ran on 20,000 CPUs, it can be finished within 1 month.

Does this maths make sense? I’d really appreciate if anyone can give any suggestions/comments on this.

Another question I have is, if the maths is correct, is Canu able to upscale to run on so many CPUs? Is there a limitation there, say it doesn't make difference (or makes things worse) above 1,000 CPUs?

Many thanks in advance for any suggestions/comments! Xi

brianwalenz commented 8 years ago

Do you have the logging from this run? Or at least remember where the time was spent? That seems excessively slow. There was a significant optimization made to the 3-overlapErrorAdjustment step on May 20th. It could also be a single CPU or disk bottleneck somewhere.

We've been able to run on around 6000 cores. The only scaling issue will be with disk bandwidth, all the large computes are embarrassingly parallel.

Our human assembles (~25 million reads) are around 15k CPU hours.

I don't know what the largest number of reads canu has assembled. The code is derived from an assembler that supported up to 2 billion.

Do you know how repetitive the genome is? A difficulty with large assemblies is organizing the overlap output into the ovlStore. This needs lots of disk, lots of bandwidth, a fair amount of memory, and OS support for lots of processes and open files.

xiwang2405 commented 8 years ago

Many thanks for your reply, Brianwalenz!

The job is still running step 1-overlapper, which has been running for 20 days. It already produced ~7000 canu-logs/.overlapInCore and unitigging/1-overlapper/overlap..out files and still generating new files. For your information, the job is using ~20 million reads as test using 60 cores, which I think is too few, but this is what we have here. Also the genome to be assembled is about 15Gb, ~80% of which is estimated as repeat.

Thanks, Xi

skoren commented 8 years ago

We've run on 30-40m reads. A quick optimization would be to lower the error rate, from the default of 0.025 on PacBio data to 0.013. I would expect that would significantly speed up this step. You could try the experimental ovbOvlFilter=1 utgOvlFilter=1 which is only available in the latest code, not in a release, though this option has not yet been fully tested.