marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
654 stars 179 forks source link

CANU output interpretation #748

Closed phbrito closed 6 years ago

phbrito commented 6 years ago

Hi, I am using CANU 1.6 in Linux to assemble a bacterial genome using 1D Nanopore reads. The program is running fine but I have many questions regarding the interpretation of the outputs and the strategies on how to produce a good assemble.

I followed the suggestion to run CANU each step at a time doing several rounds of the correction step. Through the correction rounds the major difference was obtained in the distribution of kMers. However, after 6 rounds I was not able to eliminate low-frequency kMers and I wonder how do they influence the final assembly.

CANU final result indicates the presence of 15 edges and 8 nodes, where one is a closed/circular chromosome. I know this genome has plasmids, I just do not know how many yet. Coverage does not help here as I do not understand how to interpret the values that I obtain in the headers of the fasta file. covStat goes from 31806.45 to 0.00

Unicycler for the same data + Illumina reads indicates 4 contigs, all circular, but with a chromosome slightly smaller 2771129 bp vs 2773519 bp with CANU. Coverages in unicycler are 1.00x (chromosome), 0.79x, 1.10x and 4,81x.

My strategy to run CANU was as follows: 5 rounds of –correct genomesize=2.9m corOutcoverage=500 → due to the presence of plasmids corMinCoverage=0 CorMhapsensitivity=high

1 round –correct corOutcoverage=500 corMinCoverage=4 → I was trying to eliminate low-frequency kMers, but didn´t work CorMhapsensitivity=high

-trim correctederrorrate=0.05 → because I had a large genome coverage ~160x -assemble correctederrorrate=0.05

I attached the final output and would be very happy if you could give me some hints on how to produce a better assembly using CANU. I am planning to use Pilon to correct the CANU assembly using Illumina reads but I want to make sure that I did the best I can with CANU before moving to the next downstream analysis. canu_run12.report.txt

Thanks! Patrícia

skoren commented 6 years ago

The multiple rounds of correction are only necessary for really old (R7) 1D data. For recent data you can just run Canu with defaults. Canu won't trim/circularize the chromosome or plasmids so it probably has some self-similarity which is making it longer than a trimmed version. In your case, I'd suggest running the latest tip from the repository as it has some new code to capture short plasmids in the sequencing data that might otherwise be missed and does a better job flagging circular elements.

The coverage stat is a log-odds ratio (see: http://canu.readthedocs.io/en/latest/tutorial.html#outputs). You can get coverage estimates per contig using the tgStoreDump command (http://canu.readthedocs.io/en/latest/commands/tgStoreDump.html), run it without options to get usage.

skoren commented 6 years ago

Closing, inactive.