marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
657 stars 179 forks source link

Trying to assemble small, segmented genome; number of reads per contig very low #2230

Closed clmattson closed 1 year ago

clmattson commented 1 year ago

Hello,

I'm working on assembling small, (segmented viral genomes with Canu. The total genome is around ~13kb, the 3 segments range from ~2.5 kb - 7 kb. I have a multiplexed ONT dataset where the different samples (barcodes) (which correspond to different viral isolates) range from around 20,000 reads to ~500 reads.

Running Canu with the below code, about 1/3 of the samples yield accurateish-looking assemblies (3 contigs of appropriate length). The headers for each contig look something like this: >tig00000001 len=6317 reads=70 class=contig suggestRepeat=no suggestBubble=no suggestCircular=no trim=0-6317 with never more than ~100 or so reads. I believe this means that for this contig, only 70 reads were merged together to form it. However, if I map the reads (in the same original .fasta files that I input into Canu) back to their respective assembly using minimap2, I find average depth ranging from ~200-1000 depending on the assembly and contig, with thousands of reads mapping to any given contig. I'm having trouble figuring out the issue. Why are so few reads being incorporated into every contig? And any ideas of what flag I could tweak to help this issue?

Here's the command I used. I set the min read length and min overlap lengths to be small due to the small genome size, thinking that would help, but it hasn't appeared to make much of a difference. canu -p ${bn}_small_ovl -d ${barcode}/${bn}_canu_assembly_small_ovl genomeSize=5k -minReadLength=300 -minOverlapLength=60 -nanopore ${barcode}/${bn}.unmapped.fasta; done

I will add that slightly increasing -rawErrorRate to 0.6 and -correctedErrorRate to 0.2 actually caused even fewer reads to be incorporated, to my surprise.

version: canu 2.2

machine: Linux

Thanks so much in advance!!

Best, Courtney

skoren commented 1 year ago

The default in canu is to subsample your coverage both before correction and after. This is because the string graph assemblers all have a sweet spot in coverage and higher coverage introduces more noise and hurts assembly continuity. I wouldn't just increase those defaults though because 70 reads should be more than sufficient to get an accurate consensus. Each read is itself a consensus of multiple reads (from correction) so really more than 70 reads are going into this sequence. If you wanted the best possible assembly, you could run a Nanopore-aware polishing tool like Medaka.

clmattson commented 1 year ago

Oh, ok, I did not realize that was not a simple count of the raw reads. thanks for the swift reply. I'll try it out! Thanks! Courtney

skoren commented 1 year ago

Idle