marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
658 stars 179 forks source link

Some questions about canu #126

Closed danshu closed 8 years ago

danshu commented 8 years ago

Dear All,

I'm new to pacbio assembly and I have several questions about canu which I tried but could not find any answers. I therefore post my questions here and any ideas about these questions are welcome.

  1. I have heard that It's recommended to use about 25X corrected reads for assembly. So will canu use all the corrected reads for assembly or will it by default use the highest 25X corrected reads for assembly? By the way, I failed to find any parameter to set the coverage of corrected reads used for assembly.
  2. Canu output three fasta files: contig.fasta; bubble.fasta; unassembled.fasta. So is the final assembly the sum of contig.fasta and bubble.fasta? The size of my unassembled.fasta is about 4 times of the size of contig.fasta, is this normal? And any methods to further utilize these files?
  3. It was recommended to run a parameter sweep to optimize the assembly. So when should this step be done? Should it be done after the assembly has been finished? I saw loops in the sweep script, is it trying to do assembly for different combinations of minOverLapLength and errorRate? Would the assembly running time thus be proportional to the number of combinations?
  4. I saw people always suggesting running both canu in a single command, is it because running canu manually in 3 steps (correction trimming assemble) is different from running in a single command? And if so what is the difference?

Best, Quan

brianwalenz commented 8 years ago
  1. We've seen that assembling more than 25x doesn't give any improvement, and sometimes degrades the assembly. I suspect that 'noise' events become significant. There isn't a strict 25x filter in place, rather, we correct the longest (expected corrected length) reads to give some output coverage: corOutCoverage. The correction is lossy, and the default is 40x.
  2. Bubbles are isolated because they are, in some sense, redundant. But, yes, the output is contigs + bubbles. There might be useful stuff in unassembled, and it is worth checking, but generally this is stuff that is basically a single read. There will likely be second best hits in here, so you need to then compare the results against any contig hits.

2b. One to two times genome size is a typical size of unassembled. These are high error reads, reads that were trimmed poorly, 'contigs' that are almost entirely spanned by a single read, and probably other stuff. Some repeats might end up in here.

  1. The usual parameter sweep involves just the contig construction, and is run outside canu. bogart-sweep.pl in the source tree tries far more parameters than I'd do for a production assembly. And it's out of date - most of the parameters it sweeps over don't exist any more.
  2. The two methods are exactly the same. The pipeline was split into three pieces to allow for just correction or to assemble reads corrected by something else (and optimistically, for when reads don't need correction).
danshu commented 8 years ago

@brianwalenz Thanks for your detailed answers! So if I have 30x corrected reads, would you suggest trying canu assembly using the longest 25X? For the bubbles, I'm thinking about mering them into our assembly because they are just heterozygous sequences from a diploid genome. Canu should have recorded the sequences or reads of their neighbouring nodes?

danshu commented 8 years ago

Hi,

I have some more questions.

  1. How to set repeat detect parameters? Would setting the genome size half of the best estimate be equal to setting 2x for the second parameter of repeat detection?
  2. What is the suggested corMaxEvidenceErate for repetive genome smaller than 500Mb?
  3. I found that the guessed X coverage is usually about twice of the real coverage. As I'm assembling a diploid genome, is it because that it can not distinguish between haploid and diploid genomes?

Thanks!

brianwalenz commented 8 years ago

For your earlier question, give it all the reads you have.

  1. Those parameters have been removed in the latest version.
  2. More dependent on the quality of reads you have. I can't recall ever using this parameter, except for early testing. Leave it unset unless you know you have abnormally bad reads.
  3. Canu is expecting the haploid size. I've never seen the guessed coverage (from kmers) over estimate coverage; it usually badly underestimates it.
JohnUrban commented 8 years ago

Hi @brianwalenz As for your response to "3" -- what about issue #63 ?

brianwalenz commented 8 years ago

Do the kmer histograms look sane? This is using a very simple method to find the expected hump in the kmer histogram. The peak of the hump is the reported coverage.

This coverage isn't used anywhere, but the histograms (and possibly the hump-finding method) are used to pick a kmer occurrence threshold for the overlapper used for trimming and unitigging. A higher threshold will use more repetitive seeds (and thus more CPU).

JohnUrban commented 8 years ago

Thanks for the explanations. I don't have that data anymore to check the sanity of the kmer histograms. That issue has largely been addressed already (thanks again) - I was just thinking that there was at least 1 time on record where the Guessed Coverage was over-estimated (and genome size under-estimated). Nonetheless, as you point out, under-estimating coverage seems more likely for diploid genomes. I would imagine guessed coverage under-estimation gets worse as the heterozygosity rate of a diploid genome increases - is that a fair assumption?

eugenio commented 8 years ago

what are the minimal hardware resources to run canu? what is the minimal amount of ram for a genome of estimated 4.9 Mbp?

brianwalenz commented 8 years ago

eugenio, please don't hijack issues, your question is totally unrelated.

The smallest machine we have anymore is 16gb. It will probably run OK on 8gb. 4gb might be tight. CPU time depends on coverage. I test against ecoli and drosophila with 4-12 CPUs.