rrwick / Trycycler

A tool for generating consensus long-read assemblies for bacterial genomes
GNU General Public License v3.0
306 stars 28 forks source link

Question about purpose of subsampling #73

Closed aharring83 closed 5 months ago

aharring83 commented 5 months ago

Hi Dr. Wick, Why do we need to subsample ONT data for monoisolate genome assembly. I did some benchmarking analysis using rasusa and found that when I subsample by various fractions of trimmed reads, there was no significant difference on my genome assembly using flye. The only exception was when I subsampled to 20% or less of my trimmed reads. I tried to find an explanation in the literature but could not find any and was hoping you could help me understand the purpose of this step. Your response will be greatly appreciated. Anthony Harrington

rrwick commented 5 months ago

Hi Anthony,

I don't think you need to do the read subsampling step. But when running Trycycler, it's good to have lots of input assemblies that are as independent as possible, and read subsampling can help with that in a few ways. Some examples:

I pretty much always use subsampling unless I'm starting with a read set that has not-great depth, e.g. 30x.

When you say 'there was no significant difference on my genome assembly' - I suspect there were still some differences, i.e. the assemblies weren't base-for-base identical. And when there are differences between assemblies, that creates bubbles in the graph (see step 6 here) where Trycycler can choose the best option.

Ryan