rrwick / Trycycler

A tool for generating consensus long-read assemblies for bacterial genomes
GNU General Public License v3.0
306 stars 28 forks source link

Small-scale errors after polishing #3

Closed Jeepee8820 closed 3 years ago

Jeepee8820 commented 4 years ago

Dear Ryan,

Thanks for developing this great tool. My problem is not coming from Trycycler but probably from the step 7 "Polishing after Trycycler" so apologies for bothering you with this. I have followed exactly your walk-through concerning step 7 "Polishing after Trycycler" until no changes are reported. However, on the handful of genomes performed so far with Trycycler, I have noticed that the polished bacterial genomes are still having a lot of small scales errors. This results in disrupted CDS i.e. premature stop codons. Do you have any suggestion on how to improve this? Should a more stringent Illumina reads QC with fastp be performed for example? Is there any parameter that could be tweak? Any tips here would be really appreciated! Many thanks in advance

rrwick commented 4 years ago

It's hard to answer this question generally because there could be a number of causes. When you say 'a lot of small scale errors', about how many are you talking about?

One issue I know of is genomic repeats - they make polishing difficult. In brief, an error in a repeat is hard to fix with short-read polishing because the short reads will preferentially align to the other instance(s) of the repeat. You can mitigate this problem by getting your genome as clean as possible before short-read polishing. Assuming you're using ONT reads, that means re-basecalling with the latest version of Guppy (v3.6 or later is best as I write this) and polishing with Medaka. But I wouldn't expect there to be 'a lot' of such errors in repeats - in my experience it's often just one or two in the whole genome. If your genome is highly repetitive, this would be a bigger issue.

Poor quality/depth of your Illumina reads might cause problems too, so it's certainly worth trying more QC.

I'm also curious how you're quantifying the errors. Comparing your assembly to a reference genome? Using ideel or something similar?

Ryan

rrwick commented 3 years ago

I'm going to close this issue now due to inactivity, but please let me know if you're still having issues. Happy to discuss further!