Closed slambrechts closed 3 years ago
Hey @slambrechts
Ok, you are working with soil metagenome. As far as I know, the most crucial step is the assembly. Are you happy with the assembly? e.g. check the reports/assembly_report.html
For the binning parameter. A colleague that works on ocean microbiome, argues the more samples the better. ref
Keep in mind that more samples take longer. You are mapping each sample from each bingroup to each other. I would argue that it makes only sense to put samples in the same bin group if you expect the same species in these samples.
If I may ask how did you create the PCOA?
So probably using only three groups in your case.
If atlas works as expected the bam files are not deleted and so you could try using each sample in its own bin group, then the smaller bin groups, and finally only three bingroups. Run each time atlas run binning
. And save a copy of the bin report.
However, for the downstream analysis, the dereplication is as important as the binning. During the dereplication we take only one genome per species (default 95% identity). So you might get more bins, but then they are all from the same species and at the end (atlas run genome step) you only get one per species. Obviousl you can change also the parameters of the dereplication.
For your info: I will also soon add a new binning approach #365
Hi @SilasK
Thank you very much for your comment. I have yet to try atlas, so I don't have any stats on the individual assemblies atm (still trying to figure out how to use atlas on our HPC without violating rules), but the coassembly I have is indeed fragmented #375.
The PCOA was created by using 16S amplicon data, generated from the same DNA extracts as the metagenomic data. The more or less 3 groups I'm talking about are the brown cluster on the left, the blue-green cluster at the top, and the mixed cluster in the bottom right corner:
Looking forward to try the different bingroup sizes, and compare how much bins we end up with for each approach after dereplication. Will try as you suggested and run atlas run binning
and atlas run genome
each time
You know about the cluster profile and the section in the docs about how to run atlas on a cluster, isn't it?
@SilasK yes I do, and I would really like to use it like so, but I'm afraid we are not allowed to do so (speaking to the sys admin also). We have a high memory cluster with 16 nodes, 738 GB & 36 threads each, and a cluster with 128 nodes, 250 GB & 96 threads each (Ghent university HPC in Belgium). Maximum runtime on both of them is 72 hours. So should be sufficient to use atlas in single machine mode I guess?
You only have two nodes? Or are there other with less memory?
@SilasK there are 3 other clusters, ranging from 88 to 177 GB
Ok, In my case I have many more medium-memory nodes and I'm limited to 12h for most. The advantage of using the cluster wrapper is that you could use more than one node. using the medium-cluster nodes for the normal jobs and the high-memory for the assembly. Other users can still run jobs between the atlas steps.
In any case, I suggest testing different assembly, normalization variants on a subset of samples e.g. 3. It seems it' would be best if I add the normalization step to the atlas pipeline again. But even then it would be good to know which parameters to use for the normalization.
Hi,
I'm pondering about the
BinGroup
parameter, and wonder whether a larger amount of smaller groups, or rather a few large binning groups would be preferable, and what the difference would be on downstream analyses. Ofcourse it depends what kind of samples/environment you're working with etc, but maybe also how atlas and this parameter were designed. I have 58 soil samples (5 billion reads total), and I question what the effect would be of assigning them to either 3 large BinGroups of 15-20 samples each (based on 3 big clusters on a PCoA plot), or making smaller BinGroups of 3-7 samples based on smaller groups on that same PCoA.