Better assembly with partial reads than full dataset

stachyris commented 4 months ago

Hi there!

We recently assembled a bird genome with PromithION raw data of about 80GB(Reads N50 of ~11KB). When made with the full dataset (quality trimmed), we got a pretty good assembly with N50 of 13MB and about ~1500 contigs, but with the same dataset when assembled on a different machine with --asm_coverage 40 (due to RAM bottleneck on that machine) Flye (v2.9.3) made a better assembly with N50 of 19MB and ~1200 contigs.

I looked through the literature but could not figure out why so?! I understand that --asm_coverage uses only the partial dataset for the initial steps and complete dataset for further down steps, but still this jump from 13MB to 19MB is substantial improvement we felt and wanted to see if this has been observed before and what might be the reason for it.

Thank you, Vinay

mikolmogorov commented 4 months ago

Hi Vinay,

Although this seems surprising, there may be various reasons for this. Longer reads often also have better quality; shorter reads are more likely to contain contamination if there is any. Are sizes of the assemblies any different? N50 is biased by assembly size, and NG50 would be a better metric to use. Finally, contiguity is not the only metric for assemebly quality.

stachyris commented 4 months ago

Hi,

Thank you for the reply. Actually, no, the Genome size is the same: 1.056GB (Meryl+GenomeScope estimated to be 1.15GB) Got it. Will do some more inspections once we finish polishing and other steps in the pipeline.

Thank you. Best,

mikolmogorov / Flye

Better assembly with partial reads than full dataset #672