Assembly more fragmented than expected

dmacguigan commented 3 months ago

First off, thank you for writing such an easy-to-use pipeline!

I'm using a combination of ONT reads (R10.4.1 flow cell, Q20 chemistry, error-corrected with HERRO) and HiC data for a de novo genome assembly. Our ONT coverage is not amazing (~30x) and we don't have a ton of ultra-long reads (>100 kb). But the read distribution is quite good (read N50 ~20 kb). Here's an example of the ONT read distribution from one flow cell run.

I've used several different assembly methods with this dataset. Flye and hifiasm produce fairly contiguous assemblies, with contig N50s ranging from ~5-21 Mb.

However, when I used the same dataset with verkko, my contiguity was much lower, with N50s ranging from 0.4-1.1 Mb. BUSCO scores were also dramatically worse. See the table below.

For verkko, I supplied the HERRO-corrected ONT reads using --hifi and the uncorrected ONT reads using the --nano option.

Do you have any suggestions on how we might improve our assembly contiguity with verkko? Happy to provide more details about our datasets or analyses.

Thank you for your time, Dan

skoren commented 3 months ago

First, a quick note. You can't compare assemblies like flye or hifiasm primary to hifiasm hap1/2 or verkko since the former would introduce switch errors to increase continuity while the latter would not.

I suspect the default parameters didn't properly phase much of the assembly, given that the hap1/hap2 assemblies are shorter than expected. The combined asm is comparable in completeness to the other assemblies and is about twice the size, its largest contigs are also on par with other haplotype assemblies. Do you know the divergence between your haplotypes? Can you share the noseq.gfa and the colors file for your assembly here?

dmacguigan commented 3 months ago

Good point, about comparing phased vs unphased assemblies.

I don't know exactly how divergent the haplotypes are for this genome. The data come from wild-caught individual and the species has a very large range, so we don't expect it to be inbred.

Thank you for taking a closer look. The GFA and colors table are available here. https://drive.google.com/drive/folders/1PFw99xPmPj7wY7WAJdaWpUpBGlARFsKD?usp=sharing

Also, we are expecting 2N=48.

skoren commented 3 months ago

The graph looks very fragmented with many more nodes than we normally see, the coverage is pretty low at 25-30x total so only 13x/haplotype. I think verkko is just not designed to work with the low coverage data, especially since there aren't many long reads to resolve the haplotypes. The fragmented assembly means that the Hi-C phasing doesn't work well, leading to the bad continuity. I think for verkko you'd want either more long reads and/or higher coverage.

skoren commented 2 months ago

Idle, recommend higher than 25-30x total coverage for assembly.

marbl / verkko

Assembly more fragmented than expected #261