Closed d4straub closed 3 years ago
I like it, but can we try to release 1.0.0 first without that functionality? I feel like release has been postponed enough (I know I'm to blame for that 😁 )
Sure, lets better postpone this feature than the release! Happy to hear you are working at 1.0.0!!!
Seems like you addressed the last point in my initial post, improved binning, in commit 0f155a542f2ad707a70a6df225f5da8646847a04 Great!
Did I understand right, now the pipeline separately assembles each sample but then maps all sample reads individually to each assembly? Isn't that skewing results without doing a pooled assembly as well? Just wondering... do you have a benchmark for this?
edit: grammar
I realised I never answered this. what it does right now is mapping all the samples against each assemblies, and using the coverage information for binning. It doesn't skew the results since metabat2 is able to threat coverage info independently by bam file.
I want a pooling option for co-assembly as well since I often have metagenomes with different conditions (i.e. polluted vs non-polluted area) but that will require a manifest file as input
Thanks for the explanation about MetaBAT2, I am testing this right now and see how it goes.
I recently attended a talk where it was explicitly not recommended to use data from several samples for assembly, the reasoning was that this increases complexity (e.g. more genomes) and this is more detrimental to assembly than the higher read coverage of genomes that appear in multiple samples. From this I conclude that only if no/few new genomes are introduced by combining samples (e.g. bacterial cultures with different treatments) that approach might be suitable.
If all samples/sequencing data that is passed to the pipeline should be combined, I do not see the need for more than a new Boolean input parameter (e.g. --pool_samples
). If only subsets should be combined, additional information is needed, but either one more comlumn in the file --manifest
or sample names given to --pool_samples
. For example, --pool_samples sample1,sample2;sample3,sample4
where sample1 and sample2 are co-assembled and sample3 and 4 as well. the semicolon indicates that a new group is following.
The pipeline should allow assembly of multiple samples instead/ in addition to treating them
It's a nice feature but I recommend to offer the option not to replace the current behaviour. With very similar strains involved it can lead to misassemblies
This is realized optional in dev.
Problem
The pipeline should allow assembly of multiple samples instead/ in addition to treating them individually. For example several metagenome samples from the same study might share genomes and the assembly and also binning can benefit from pooling these samples instead of treating them individually as done right now.
Possible solutions
MetaSPAdes 3.13.0 using Illumina and Nanopore data (hybrid assembly) can't handle several samples just yet, but MEGAHIT and possibly Illumina-only SPAdes should be able to handle this. Also metabat allows the usage of multiple samples for improved binning.
--pool_samples
) could be added to allow optional poolingprocess megahit
andprocess spades
Binning could be improved by using depth information from several samples when available (e.g. with jgi_summarize_bam_contig_depths --outputDepth depth.txt *.bam as decribed in https://bitbucket.org/berkeleylab/metabat).Because done.Feedback welcome
Is that of interest? For me that's not high priority but definitely interesting to have.