Number of Contigs and Canu Version

ekokrek commented 5 years ago

Hello,

I try to learn Canu through reproducing the studies published. Currently, I focus on only Nanopore data, no polishing. Publications I went through investigate bacteria (F.columnare, A.hydrophila, K.oxytoca, E.coli) and eukaryotes (c.elegans). I work on a single workstation with 63gb RAM, linux.

Generally they use previous versions of Canu like 1.4 and 1.5. I want to be able to reproduce their result and shift to the latest version of Canu, 1.8. And I noticed that even though the maximum contig size or NG50/LG50 doesn't change much when I use latest version, the number of contigs hence the assembly size increases a lot, sometimes doubles (for c.elegans: 106 to 162; for a.hydrophila: 22 to 55) when I go from v1.4-v.15 to v1.8.

So I wonder how to interpret these results, how to identify the source of this difference.

Also when I try to assemble based on multiple read files from the same organism, again a huge difference in terms of many metrics like contig size, assembly size, NG50/LG50 etc. Maybe this last one is a general assembly issue/question but I wanted to ask here.

Thank you in advance

skoren commented 5 years ago

In general, 1.8 has some adjusted parameters to speed up assembly on recent nanopore data. If you're using old datasets without re-calling they may not be very representative of current data. Canu 1.8 is also tries to keep more sequences in the assembly to rescue plasmids/short sequences that would have been lost in 1.4 and 1.5. That should increase the contig size and assembly size slightly but shouldn't impact the NG50 which is consistent with what you are seeing.

The assemblies were the genome size doubles are suspicious, I'd guess the error rate tolerance is insufficient for the older data. If you can re-call it with a more recent base caller and see similar effects or can share a link to the raw fast5 data, we could take a look as well.

I'm not sure what you mean by "multiple read files from the same organism", different strains of the same organism or exactly the same DNA prepped multiple times in multiple runs?

ekokrek commented 5 years ago

Pardon my English, the thing that doubles is contig numbers not the assembly size. The assembly size increases but not doubles. Usually I skip basecalling step and start from fastq files but new-old data or new-old basecaller difference makes sense.

For the last part, with "multiple read files from the same organism" I meant the same organism. Let's say "Aeromonas hydrophila" but different DNA - different runs. Then would it be expected to have a less contiguous assembly? Do strains differ that much? Sorry, maybe these are not exactly Canu-related issues.

Thank you very much!

skoren commented 5 years ago

So just to clarify, the c.elegans: 106 to 162; for a.hydrophila: 22 to 55 increase is in contig count not assembly size? That isn't so surprising.

As for mixing strains, it depends on the organism but essentially, yes mixing strains is very bad for assembly. The issue is not that they are so different but the strains are very similar. In fact, mixing unrelated genomes will probably assemble fine. Strain resolution is one of the major complications in metagenomic assembly since you have large stretches which look identical with a bunch of variants connected to them. To an assembler, this looks like a repeat and causes assembly breaks. You could try the metagenomic parameters suggested on the FAQ in those cases which will likely improve your assembly a bit but I wouldn't expect a mix of strains to assemble as well as individual clonal samples.

ekokrek commented 5 years ago

So just to clarify, the c.elegans: 106 to 162; for a.hydrophila: 22 to 55 increase is in contig count not assembly size?

Yes, exactly.

The issue is not that they are so different but the strains are very similar.

Got that. Thank you very much.

marbl / canu

Number of Contigs and Canu Version #1404