marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
654 stars 179 forks source link

Canu takes >7 days at cormhap for a highly repetitive genome (~1G) #1329

Closed YuanwenGuo closed 5 years ago

YuanwenGuo commented 5 years ago

Thank you for developing such a great tool to facilitate genome assembly!

I am trying to use all available nodes on our university slurm clusters, but the cormhap step took more than seven days and still didn't finish. I checked the mhap.*.out files, and seems like only 164 batch jobs are finished (total number of batch jobs is 597). I am wondering if there is any way to accelerate the assembly?

Our genome is a highly repetitive plant species, and its estimated size is ~1G. We have about 15X Nanopore coverage data.

I am using canu-1.8, and my command is:

!/bin/bash

/canu-1.8/Linux-amd64/bin/canu gridOptions="--time=80:00:00 --partition=killable.q" gridOptionsJobName=canu -p canu_Nano -d ($) genomeSize=1g correctedErrorRate=0.154 -nanopore-raw ($.fastq)

I will appreciate any suggestions or comments about this issue!

Best, Yuanwen

skoren commented 5 years ago

First off, 15x is not really enough to assemble and is part of why it's taking so long. The parameters for overlaps are automatically turned up slowing down the computation.

There are some suggestions for repetitive genomes on the FAQ: https://canu.readthedocs.io/en/latest/faq.html#my-assembly-is-running-out-of-space-is-too-slow but I expect this will degrade the assembly at your lower coverage. You could try the parameters suggested there but leave out --threshold 0.80 --num-hashes 512 --num-min-matches 3. We typically recommend at least 20x coverage.

YuanwenGuo commented 5 years ago

Thank you for the prompt reply! I will try run canu with the parameters you suggested.

Right now we only have 15X Nanopore long reads data for the assembly. Besides, we also have about 50X illumina short reads data. We plan to use Pilon for further polish after Canu assembly as suggested by Canu quick start tutorial. I wonder if this combination strategy will make a reasonable assembly?

Best, Yuanwen

skoren commented 5 years ago

The issue with 15x isn't so much the base quality but that you might not have enough coverage to assemble the full genome. In that case pilon isn't going to help. If you're not planning to get more than 20x coverage, I'd suggest trying a hybrid assembler instead.

YuanwenGuo commented 5 years ago

Thank you for the suggestion! We tried to use some hybrid assembler like Masurca, but it's taking too long to finish the assembly. I probably will try the parameters you suggested, and see how it works.

Much appreciated! Yuanwen

skoren commented 5 years ago

Closing since initial issue explained by the very low coverage, feel free to post updates on how the assembly turns out if you get one.