rrwick / Unicycler

hybrid assembly pipeline for bacterial genomes
GNU General Public License v3.0
547 stars 131 forks source link

Unicycler result inconsistent with high coverage #112

Open marinococcus opened 6 years ago

marinococcus commented 6 years ago

Dear Unicycler team,

I chose Unicycler to assemble de novo bacterial genomes with ONT sequences, because I tried with several ones and Unicycler gived me better results. However I start to observe something strange and I did not find any information about it: I think Unicycler can fail its computation with high coverage, for example maybe because of a memory issue, but the log file will say "success" so that you don't see it.

More details about the history: Recently we sequenced a genome and get 400X of data. Expected genome size was nearly 9Mb, and the Unicycler assembly output was only 4Mb with 132 contigs! The logfile only shows "success" "complete" etc.... If I subsample the fastq at 200X, then the assembly file is 8Mb with 50 contigs, at 100X the assembly file is 9Mb with 11 contigs.

Hence I tested Unicycler with several sequencing depths (200X, 100X, 50X, random subsampling) for 2 of our sequenced genomes. I thought the "assembly accuracy / sequencing depth" curves will look like the same for both genomes, but it's was more like a different "scale" curve for both genome which surprised me a little. See here (BUSCO is an assembly QC tool that looks for expected orthologuous genes) : accuracy_according_to_depth

Hence the ultimate test: I tried again Unicycler on the same 400X data. Again all is "success" and "complete" in the log file, but the output assembly is 1.8Mb with 49 contigs that is not the same result at all.

It clearly appears that something somehow fails but I can't see any clue of that in log file. Have you any idea of what happen? Why this is not reported in the log file? Or maybe did I miss an obvious fail message somewhere?

Here is the log file: unicycler.log

Thank you a lot for your attention.

stevenjdunn commented 6 years ago

Have you tried manually assembling with minimap/miniasm to check the output and log? How are you subsampling your reads?

If you only have long reads, Canu occasionally handles high coverage assemblies better (at least in my experience). Or you can subsample your reads based on a number of metrics using rrwick's filtlong - when we get the odd barcode with higher than anticipated coverage I usually run it through this using the -target_bases flag to normalise for depth. Works a treat!

angelovangel commented 5 years ago

I am observing similar results - when using excessive coverage with Nanopore-only reads, the assemblies just crash. I am sampling (seqtk sample) from a set of ~700k reads, and doing assemblies with each sample. From 40x to 160x coverage everything looks good, above that the assemblies are a mess. Apart from a memory issue, could it be that the higher read number increases the chances of (very) long chimeric reads which cause the problems?