marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
660 stars 179 forks source link

Falcon vs Canu correction #1287

Closed YingHu-jlau closed 5 years ago

YingHu-jlau commented 5 years ago

Hi,

This is not an issue but a question about CANU/FALCON correction.

I am working on a plant genome assembly with Pacbio. I am curious about the difference between CANU correction and FALCON correction. Do you have any assembly showing FALCON correction can give better corrected reads for following trimming and assembly step?

Thanks,

skoren commented 5 years ago

Falcon correction tends to smash together haplotypes and repeats more aggressively than Canu does. This can give a larger N50 in some cases we've seen (some plants) but you'd be collapsing haplotypes rather than separating them. In other cases this gives worse N50 (some fish) probably because the repeats were made too similar to each other by the falcon correction.

The other downside to the Falcon correction is that it will only take the longest above your specified coverage or threshold. Canu will select reads to make sure it still has representation of short sequences (like plasmids) when doing the correction and will select "longest" based on their predicted corrected length not their input lengths.

YingHu-jlau commented 5 years ago

Hi, skoren,

Thanks for your quick response. You said Falcon correction is more aggressive to put together haplotypes and repeats. Does that mean Falcon may introduce some false assembly compared to Canu?

Another question is if I am not interested in keeping plasmids but getting longer assembly using Canu, can I set the corOutCoverage to 60 (higher than my genome coverage) and minReadLength = 5000 to keep more longer reads instead of the longest 40X of data? Can I get a larger N50 by this way?

Thanks,

skoren commented 5 years ago

Regular falcon may collapse more yes, but falcon-unzip tries to fix some of this by going back to the raw reads and un-collapsing parts of the assembly.

It's unlikely adding more coverage is going to help assembly unless you have a very heterozygous genome (see FAQ). Canu already selects the longest reads based on their estimated corrected length so all you would do by setting minReadLength=5000 is to remove some shorter corrected reads. It's unlikely to change the assembly much since most of those short reads will be contained in larger ones anyway.