marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
660 stars 179 forks source link

v1.5 significantly slower than v1.4 #537

Closed kevfengler227 closed 7 years ago

kevfengler227 commented 7 years ago

Hello, I've noticed a significant slowdown with v1.5 versus v1.4 when running the exact command for assembling Pacbio corrected reads on a lsf cluster during the overlap jobs in 1-overlapper. The jobs are created and submitted to the cluster with the same resources, but with v1.5 these jobs take many hours instead of a few minutes to run. I am not sure how to pinpoint or troubleshoot the issue. I do see a difference in the 0-mercounts /CANU.ms22.estMerThresh.err file. Could the increase in maxCount below make the overlap jobs run a lot slower?

Thanks, KF

v1.4 Set maxCount to 155, which will cover 99.00% of distinct mers and 63.66% of all mers

v1.5 Set maxCount to 570 (3550 kmers), which will cover 99.75% of distinct mers and 72.00% of all mers. Average number of kmers between count 545 and 596 is 2081955 Found break in kmer coverage at 50866 Set maxCount to 12574 (7 kmers), which will cover 99.99% of distinct mers and 90.00% of all mers

canu-1.5/Linux-amd64/bin/canu -assemble -p CANU -d out genomeSize=1.0g -pacbio-corrected v3_10k.fasta useGrid=true minReadLength=3000 minOverlapLength=3000 contigFilter="5 20000 0.75 0.75 5"

brianwalenz commented 7 years ago

Nice detective work, that's exactly the problem.

This is deciding how repetitive kmer seeds can be when finding overlaps. Set it too low, and the assembler won't find overlaps in repeats; too high and it'll spend a long time trying to align repeats that don't align. I'm probably setting this way too high for long reads.

An alternate method is to use ovlMerDistinct=0.99 to use the 99% least common kmers for seeds. This is the same as what v1.4 picked, but it's not guaranteed to be the same. Your example is showing why I increased it - the old method picked a threshold of 155 occurrences; if you've got 50x coverage, this means that a 3-copy repeat won't have any overlaps. It's probably not a significant issue as the repeat needs to be exact and longer than a read, but when we're getting up to chromosome size contigs, even one break makes the stats terrible.

I'll leave this open to remind me to do something, but I've got nothing in particular planned.

morgancolp commented 7 years ago

My 1-overlapper in the unitigging step is also taking days on two of the 13 processes it plans to run. I'm using raw nanopore reads to assemble an ~40-45 Mbp eukaryotic genome with the input read file allowing about 55x coverage.

/scratch2/software/canu/canu-jun-2017/Linux-amd64/bin/canu \ -p Ac_canu2k -d Ac_canu2k \ genomeSize=41m \ maxMemory=45 \ maxThreads=16 \ useGrid=false \ -nanopore-raw ac_1d_2000K_cut.fq

In my Ac_canu2k.ms22.estMerThresh.err I am getting the following:

RAW MER COUNTS: distinct: 237068133 (different kmer sequences) unique: 158004535 (single-copy kmers) total: 1204286545 (kmers in sequences)

Set maxCount to 100 (1768 kmers), which will cover 99.75% of distinct mers and 78.48% of all mers. Average number of kmers between count 75 and 126 is 180914 Found break in kmer coverage at 6614 Set maxCount to 4419 (19 kmers), which will cover 99.98% of distinct mers and 90.01% of all mers.

Could the high count be a reason why it is taking so long? It only took a couple days to go through 6/13 processes but 7 and 8 have already been running for 5 days now. Are each of the processes supposed to take approximately the same amount of time?

Thanks, Morgan Colp

skoren commented 7 years ago

The steps don't take the same amount of time and vary with read length. Your issue is likely not the k-mer threshold but some long reads falling in the two lagging partitions (issue #521) and is a known issue with very long reads. You can try the suggestions there (including decreasing error rate, increasing the threads for the last two partitions) or you could try the faster overlapper using the options overlapper=mhap utgReAlign=true. You can do this in a separate assembly using the -assemble option and the trimmed reads from this run.

morgancolp commented 7 years ago

Thanks! I'll try decreasing the error rate for now and see if that helps. If not, I'll look further at that issue you referenced.

skoren commented 7 years ago

Closing, inactive. Please open a new issue if you encounter other errors with this run/assembly.

We should probably take the minimum of the 1.4 and 1.5 ways to estimate the threshold.