Handle several metagenome samples not individually but synergistically.

d4straub commented 5 years ago

Problem

The pipeline should allow assembly of multiple samples instead/ in addition to treating them individually. For example several metagenome samples from the same study might share genomes and the assembly and also binning can benefit from pooling these samples instead of treating them individually as done right now.

Possible solutions

MetaSPAdes 3.13.0 using Illumina and Nanopore data (hybrid assembly) can't handle several samples just yet, but MEGAHIT and possibly Illumina-only SPAdes should be able to handle this. Also metabat allows the usage of multiple samples for improved binning.

a parameter (such as --pool_samples) could be added to allow optional pooling
several samples could be simply added to the parameters for assembly in process megahit and process spades
Binning could be improved by using depth information from several samples when available (e.g. with jgi_summarize_bam_contig_depths --outputDepth depth.txt *.bam as decribed in https://bitbucket.org/berkeleylab/metabat). Because done.

Feedback welcome

Is that of interest? For me that's not high priority but definitely interesting to have.

HadrienG commented 5 years ago

I like it, but can we try to release 1.0.0 first without that functionality? I feel like release has been postponed enough (I know I'm to blame for that 😁 )

d4straub commented 5 years ago

Sure, lets better postpone this feature than the release! Happy to hear you are working at 1.0.0!!!

d4straub commented 4 years ago

Seems like you addressed the last point in my initial post, improved binning, in commit 0f155a542f2ad707a70a6df225f5da8646847a04 Great!

Did I understand right, now the pipeline separately assembles each sample but then maps all sample reads individually to each assembly? Isn't that skewing results without doing a pooled assembly as well? Just wondering... do you have a benchmark for this?

edit: grammar

HadrienG commented 4 years ago

I realised I never answered this. what it does right now is mapping all the samples against each assemblies, and using the coverage information for binning. It doesn't skew the results since metabat2 is able to threat coverage info independently by bam file.

I want a pooling option for co-assembly as well since I often have metagenomes with different conditions (i.e. polluted vs non-polluted area) but that will require a manifest file as input

d4straub commented 4 years ago

Thanks for the explanation about MetaBAT2, I am testing this right now and see how it goes.

I recently attended a talk where it was explicitly not recommended to use data from several samples for assembly, the reasoning was that this increases complexity (e.g. more genomes) and this is more detrimental to assembly than the higher read coverage of genomes that appear in multiple samples. From this I conclude that only if no/few new genomes are introduced by combining samples (e.g. bacterial cultures with different treatments) that approach might be suitable.

If all samples/sequencing data that is passed to the pipeline should be combined, I do not see the need for more than a new Boolean input parameter (e.g. --pool_samples). If only subsets should be combined, additional information is needed, but either one more comlumn in the file --manifest or sample names given to --pool_samples. For example, --pool_samples sample1,sample2;sample3,sample4 where sample1 and sample2 are co-assembled and sample3 and 4 as well. the semicolon indicates that a new group is following.

telatin commented 4 years ago

The pipeline should allow assembly of multiple samples instead/ in addition to treating them

It's a nice feature but I recommend to offer the option not to replace the current behaviour. With very similar strains involved it can lead to misassemblies

d4straub commented 3 years ago

This is realized optional in dev.

nf-core / mag