rrwick / Trycycler

A tool for generating consensus long-read assemblies for bacterial genomes
GNU General Public License v3.0
306 stars 28 forks source link

Maximum insertion/deletion sizes #16

Closed termithorbor closed 3 years ago

termithorbor commented 3 years ago

Hi,

all my contigs have problems withe the maximum insertion size.

Maximum insertion/deletion sizes: A_contig_1: 0 417 481 304 300 310 846 691 598 B_utg000001l: 417 0 481 298 298 298 1647 298 598 C_Utg574: 481 481 0 481 481 481 1647 481 598 D_contig_2: 304 298 481 0 9 2 1647 20 598 E_utg000001l: 300 298 481 9 0 9 1647 20 598 G_contig_2: 310 298 481 2 9 0 1647 20 598 H_utg000001l: 846 1647 1647 1647 1647 1647 0 483 598 J_contig_2: 691 298 481 20 20 20 483 0 598 K_utg000001c: 598 598 598 598 598 598 598 598 0

I therefore just used --max_indel_size 1650 for trycycler reconcile. Is it okay to do so? What is the problem with my sequences?

Thanks in advance.

rrwick commented 3 years ago

This is happening because the assemblies are suffering from some larger-than-normal-scale errors. I've encountered this too, and it's not always clear to me why some genomes have these bigger indels while others don't. Ideally most of these values will all be small (<100) which would indicate that the assemblies are in good agreement with each other over their full lengths, but that's not always possible.

I too have found that I often need to increase the --max_indel_size value to get a cluster reconciled, and so I've just pushed a change to Trycycler's main branch which increase the default to 1000.

In your case, I think that running with --max_indel_size 1650 is probably okay, though you might want to toss out H_utg000001l, as that one seems worse than the rest. But don't stress about it too much - I suspect you'd get a similar result whether or not you discarded that contig.

Thanks for bringing this up!