nf-core / bacass

Simple bacterial assembly and annotation pipeline
https://nf-co.re/bacass
MIT License
60 stars 41 forks source link

List of ideas to improve assemblies #57

Open d4straub opened 3 years ago

d4straub commented 3 years ago

This is a collection of ideas that should be considered after the DSL2 conversion #56 is finished. The list is subject to change. Any ideas or discussions are welcome.

Preprocessing (check out nf-core/mag, any other examples out there?)

Assemblers:

Assembly QC:

Structural:

Defaults

Daniel-VM commented 10 months ago

Working on Flye and Pilon!

erinyoung commented 10 months ago

add option to down-sample reads, because sometimes this can actually improve assembly

Filtlong can down-sample reads to the longest/highest quality reads and rasusa can downsample randomly.

I know there are more papers about the ideal depth for assembly, but I can only find this old one for now (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0060204).

In my own experience, there are a lot more sequencing artifacts once you get above 100X.

erinyoung commented 10 months ago

Another idea I recommend adding is a rotation step. This ensures all bacterial chromosomes at least start at dnaA.

A case-in-point. These are two chromosomes from a clonal outbreak. They are actual very similar, but one wasn't rotated correctly.

alt text

There are a few tools that rotate circular sequences. I think circlator fixstart (abandonware) and dnaapler are the ones that I use most.

erinyoung commented 10 months ago

For Assembly QC, I'm a fan of gfastats for metrics about the created gfa files and nanoplot. They have a lot of overlapping features, but gfastats does indicate if a sequence is circular. Nanoplot already has a module in multiQC.

d4straub commented 10 months ago

I actually made very good experience for nanopore assembly with dragonflye (in nf-core modules: https://nf-co.re/modules/dragonflye), the results were close to identical with trycycler results, but execution of the former was very fast (few minutes) while with trycycler it was a chore with many manual inventions.

Daniel-VM commented 10 months ago

Those are really good points @erinyoung and @d4straub 🙌🏾 🙌🏾 .

Downsample step

Yep, downsample is indeed necessary. We could try random subsampling with rasusa.. In De Maio N et.al., 2019 mentioned that the random strategy generates better assemblies compared to filtering strategy. But, it always depends on the input data and goal. Nevertheless, we can think about adding Filtlong or NanoFilt in the quality filtering step (after adapter trimming with porechop?).

Rotation step

Sure, but I think that Ciclator is not supported either... What do you suggest? Adding ciclator together with dnaapler?, or just dnaapler?

dragonflye - Longreads assembly

Interesting, I haven't tried this tool yet. But if it overcomes the manual intervention of Tricycler, then it would be great to add this module. I know that Flye allows not only ONT but also PACBIO. dragonfly works with ONT reads only, doesn't it? .

Daniel-VM commented 10 months ago

I have found these two papers that may help us to decide. Both include a detailed flowchart with some of the tools we already have included and additional tools/strategies:

Molina-Mora J.A et.al, 2020

LaSarre B et.al., 2022

d4straub commented 10 months ago

Trycycler will require large effort to automatize. For example https://github.com/rrwick/Trycycler/issues/47 So Dragonflye is the way to go for now I think.

erinyoung commented 10 months ago

Here's a blog post from Dr. Wick about depth and quality : https://rrwick.github.io/2023/11/06/accuracy-vs-depth-update.html

You can see in the plot that accuracy improved up to ~100× depth, after which additional reads brought no benefit. In fact, some of the genomes got a bit worse with higher depth, which was surprising.