jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
380 stars 80 forks source link

Dereplication step after sequential mode with long reads #881

Closed Selucote82 closed 2 months ago

Selucote82 commented 2 months ago

Dear SqueezeMeta developers,

I know this issue I open it's more focused in a non totally SqueezeMeta related issue, but I would like to know your suggestions, if possible.

I launched SM in sequential mode in five separated samples sequenced with ONT. As I have seen in the wiki docs, in the mentioned mode, besides the shorter output of results, it's necessary to de-replicate the generated bins. In this point, I want to proceed with your suggestion using Drep. However, I have read that Das Tool is also used for individual assemblies, but SM already uses Das Tool as part of its process.

So, why is it necessary to de-replicate with an external step (F.i. drep) if the SM integrated tool "Das Tool" does the (presumably) the same?

Thank you!

fpusan commented 2 months ago

When using the sequential mode, each sample gets assembled and processed individually.

The problem with is that if the same organism is present in two different samples, then it will generate two different sets of contigs (one per assembly). While, if you were doing a co-assembly then it would only generate one set of contigs (since all the reads are assembled together, reads coming from the same organism get assembled into the same contigs even if they come from different samples).

Let's imagine a wonderful world in which we managed to assemble every organism into a single contig. Let's also imagine that that we have two samples (1 and 2), and both of them contain the same organisms A, B, C.

With a coassembly we would only get three contigs (A, B, and C). When assembling every sample individually we would two sets of contigs (one per assembly A1, B1, C1, A2, B2, C2). These are actually redundant (A1 and A2 are the same, etc) but we have no easy way to know which ones when working with a large complex dataset.

So how do we group this together if we want to track the abundance of features across different samples?

This is perfectly fine if you are only interested in taxonomic / functional profiles, since redundant contigs will be annotated similarly. So e.g. for taxonomy A1 and A2 will both annotated as "Escherichia", so at this point we can just track the abundance of the "Escherichia" feature across our samples regardless of whether we were originally working with a coassembly or with a lot of individual assemblies.

But it becomes more tricky for bins, since in this case what are directly working with nucleotide sequences. In a coassembly it is fine because we don't have redundant contigs, so we will (for the most part) get one bin per "species" (quoted here since species definition in prokaryotes is tricky/controversial, but nonetheless we need some sort of operational unit). But with individual assemblies the same "species" will generate a different bin for each sample in which it is present.

DREP (or mOTUlizer, which I personally prefer using) are used to cluster all your bins based on their nucleotide similarity, so you can identify those that are coming from the same "species", and take this into account in your analysis/discussion. Bins coming from different individual assemblies that belong to the same cluster would have likely produced a single bin had you done a coassembly, so you can treat them as coming from the same "species". This may even be advantageous, as coassemblies tend to produce a consensus bin for the whole "species", ignoring the accessory genome (which may or may not be interesting to you).

Selucote82 commented 2 months ago

Thank you for your quick response and your deep and detailed explanation. In my case, although I have some samples which they show similarities in presence and (in some of them) abundance of microorganisms, since I'm analyzing nearby soils with different uses, there're many differences beyond, so to speak, the "core" of communities with more or les presence in every sample, so that's why I decided to analyze separately every sample. Anyway, I think I'm going to try with the co-assembly mode joining those closer samples (regarding the microbial Communities) and check the results.

I'll take a look to mOTUlizer as well, thank you so much!

jtamames commented 2 months ago

Hello Thanks for the interesting question. Indeed DASTool and dRep (or mOTUlizer) are not equivalent. Their uses are rather different. DASTool tries to refine the bins by reconciling different binning results. We use it for producing a kind of consensus results form different binning programs (Metabat2 and Concoct by default). Each of these will produce a different set of bins, and DASTool will try to unify both by moving contigs out of the bins until it optimizes completeness and contamination stats. Notice that it does not intend to identify bins that are equal, just tries to unify results form THE SAME set of samples. On the other hand, dRep does try to identify bins that are the same from DIFFERENT sets of samples. It does not try to refine the bins, just to dereplicate them. So the uses of both tools are very different and can be complimentary. Regarding the coassembly vs sequential binning, in my experience by using coassembly you will increase completeness and reduce contamination for the bins corresponding to the set of organisms present in most samples. By using sequential, you will have more bins but less complete, and also will let you recover more easily bins present in one sample and not in the rest. Best, J

Selucote82 commented 2 months ago

Excellent replies from you two, Fernando and Javier! After your words, It seems obvious that the sequential mode is useful when someone wants to measure the (microbial) populations, as is my case, and coassembly to get a suitable set of genomes, with good quality and low contamination.

Thank you for dedicating your time to reply so fast and understandable!

fpusan commented 2 months ago

Glad to have helped! Closing issue