Closed SilasK closed 3 years ago
I think any effort to co-assemble samples will have to come after first trying to support custom annotation libraries. EC from Prokka is nice, but we need to add something like MetaCyc or KEGG to improve cross-sample comparisons.
@brwnj What do you think of combining the contigs from all samples and applying concoct or maxbin on them, so can we identify MAGs in all samples and generate the a table of abundances for each MAGs, which was requested several times. It should also reduce the work as we call genes only once.
@SilasK My lab mate @jmtsuji has a prototype of a co-assembly extension for Atlas.
Hey @jmtsuji , @LeeBergstrand
Thank you for your contribution. My Idea is to create a ATLS-organisation and gather all the extension in one place. what do you think?
Have you any comparisons of co-assembly and single-sample assembly? According to the authors of dRep co-assemblies are much more fragmented. In addition it needs more resources. I'm trying to make a benchmark of a megahit co-assembly of the 5 cami-samples, which didn't finish after 4 days. So I don't know if this is a practical option.
If I see correctly you want to use differential abundance binning? Do you mean to use the coverage information of different samples for binning the contigs?
In one of the last updates, I integrated the aproach used in DAS Tool article:
assembling each sample separately but then mapping all reads from samples of the same group to the assembly of each sample of the group. (groups can be specified in the config file with group:
attribute to each sample)
Of course, this needs a lot of mapping. but I expect the assemblies to be more continuous.
What do you think?
Hi @SilasK ,
I think it could make sense to create an ATLAS organization, but I would not necessarily recommend my extensions to other users at large at the moment. The extensions were quick scripts meant for internal lab use for specific analysis projects. They are all based on version 1.0.22 (so might break on higher versions) and have some settings hard-coded (e.g., must use paired-end reads). Also, they're written as shell scripts rather than snakefiles. But I do think some of their functions are quite useful (e.g., importing ATLAS results into Anvi'o for interactive bin cleanup as a post-processing step, or estimating the relative abundances of dereplicated genome bins across all samples). If there is interest in broader usage of some of these extensions, I could potentially go back and refactor them or help to integrate their contents into ATLAS snakefiles.
You saw correctly for the co-assembly.sh
script: the principal is that the script performs differential abundance binning via MetaBAT after performing coassembly.
I think your workflow of doing individual assembly, read mapping from individual samples, and then doing differential coverage binning makes sense. We've found internally (no hard data to go with this, however!) that you can end up with very fragmented co-assemblies especially if the input samples have lower amounts of shared populations. I suspect that the coassembly approach could outperform individual assemblies if your input samples have a lot of shared information and lower sequencing depth, but I've never generated hard data to show that this is true. It would be interesting it ATLAS incorporated a co-assembly option that could allow users to easily make some direct comparisons between the quality (e.g., N50, # contigs, bin completeness/contamination) of assemblies via the two methods.
P.S. Thanks for all your work on ATLAS -- I think it's a great tool with a lot of potential.
Thank you for the compliment. What do you think would be the best way to stay informed about new developments of Atlas? A dedicated issue, slack or Twitter?
Check out the new binninh module in Atlas with 'atlas bin-genomes'
On Thu, Nov 15, 2018, 17:32 Jackson M. Tsuji notifications@github.com wrote:
P.S. Thanks for all your work on ATLAS -- I think it's a great tool with a lot of potential.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pnnl/atlas/issues/32#issuecomment-439102843, or mute the thread https://github.com/notifications/unsubscribe-auth/AHLK2o7hY1STt4O6bu8Zd0w9HC9pU-Clks5uvZcGgaJpZM4QFltW .
@SilasK There's also Gitter (https://gitter.im) which provides "per repo" chat rooms. You can authenticate via your GitHub account which makes it easy for new users to join.
We're using it for one of our lab's tools: https://github.com/MetAnnotate/MetAnnotate
I am having some difficulties in getting MAGs with low depth sequencing samples, which are interesting for me. Some of these projects have multiple replicates however. Do you think co-assembly with replicates should improve this and luckily get some MAGs with these? Thanks
@botellaflotante If your samples are indeed from the same environment and are expected to capture the same organism genomes, it can make sense to pool samples. You can think of doing a pooled assembly as setting up an environment where you can merge half the genome of an organism found in one sample with the other half of the genome found in another sample. If you pool the two samples together, the assembler can stitch the two halves together. A pooling technique can be potentially harmful if you have different strains of one organism across different microenvironments. If assembling samples from across environments, the assembler could potentially end up merging strains (this is assembler and situation dependant). For replicates, this is less of an issue.
If you are pooling replicates, don't expect it to be a silver bullet to address your depth issues. I had pooled shallow depth triplicate metagenomes before and assembled and binned them with Atlas, and the results went from 5 good quality MAGs per each triplicate (often the same taxa) to 20 quality MAGs pooled. The increase isn't Atlas's fault. It is just the limits of what can be assembled by the assembler, given the input data amount. The most abundant organism in the dataset tends to soak up most of the read data. So after pooling, it mostly makes good quality mags better and allows you to find a few more partial complete MAGs. One needs orders of magnitude more input data to get orders of magnitude more MAGs.
@botellaflotante In theory, there is a "quick and dirty" way to implement co-assembly in ATLAS2. Let's say you have three metagenomes. Before starting ATLAS, you can pool (concatenate) their raw read (FastQ) files together into a single R1 and R2 FastQ file. Then, you can run ATLAS with four samples: the pooled sample and the three unpooled samples. Put all of them in the same binning group in the samples.tsv
file. ATLAS will then perform co-assembly on the pooled sample, and it will use read mapping info from the unpooled samples to guide genome binning. This method is not perfect, because the pooled sample will also be mapped to itself to guide genome binning, but it will at least give you an idea of whether co-assembly will help in your case. Note the comment from @LeeBergstrand above that co-assembly could result in more fragmented genome bins, depending on the nature of your metagenomes. All the best!
@jmtsuji Good points on the binning steps.
@brwnj My professor and I don't think that a co-assembly over multiple files is a good Idea. He proposes me to do filter out known genomes to better assemble unknown. On the other hand, co-assembly is something which is done in the field (e.g. recomended by Meren anvi'o).
Let's try and see if megahit manages the computational burden.