rrwick / Unicycler

hybrid assembly pipeline for bacterial genomes
GNU General Public License v3.0
566 stars 131 forks source link

weird choice of "best assembly" #285

Open jflot opened 2 years ago

jflot commented 2 years ago
image

Here for this assembly although the score does increase with each increase of K the choice made seems particularly poor... Maybe the way the score is calculated is not suited to this type of situation?

rrwick commented 2 years ago

Unicycler chooses the 'best' assembly using contig count and dead end count, so I see why it made the choice it did. I don't, however, understand what's going on with SPAdes! I.e. why do k-mers 63+ result in such small assemblies?

If this is with the current version of Unicycler (v0.5.0), the raw SPAdes graphs should be in the output (prefixed with 001). They might shed some light on this. My hunch is that there is something weird/wrong with this read set - contamination maybe? But I don't know!

Ryan

tamascogustavo commented 2 years ago

I am facing the same issue. Final assembly selected by Unicycler is quite small (~300K).

Is there a way to use other metrics to select the best assembly?

asl commented 2 years ago

I don't, however, understand what's going on with SPAdes! I.e. why do k-mers 63+ result in such small assemblies?

The answer is quite simple. Likely the input reads are only 75 bp long. So, after half of the read length the graph is just a pile of barely connected reads that are removed by the graph simplification algorithms.