BinGroup parameter - Githubissues

slambrechts commented 3 years ago

Hi,

I'm pondering about the BinGroupparameter, and wonder whether a larger amount of smaller groups, or rather a few large binning groups would be preferable, and what the difference would be on downstream analyses. Ofcourse it depends what kind of samples/environment you're working with etc, but maybe also how atlas and this parameter were designed. I have 58 soil samples (5 billion reads total), and I question what the effect would be of assigning them to either 3 large BinGroups of 15-20 samples each (based on 3 big clusters on a PCoA plot), or making smaller BinGroups of 3-7 samples based on smaller groups on that same PCoA.

SilasK commented 3 years ago

Hey @slambrechts

Ok, you are working with soil metagenome. As far as I know, the most crucial step is the assembly. Are you happy with the assembly? e.g. check the reports/assembly_report.html

For the binning parameter. A colleague that works on ocean microbiome, argues the more samples the better. ref

Keep in mind that more samples take longer. You are mapping each sample from each bingroup to each other. I would argue that it makes only sense to put samples in the same bin group if you expect the same species in these samples.

If I may ask how did you create the PCOA?

So probably using only three groups in your case. If atlas works as expected the bam files are not deleted and so you could try using each sample in its own bin group, then the smaller bin groups, and finally only three bingroups. Run each time atlas run binning. And save a copy of the bin report.

However, for the downstream analysis, the dereplication is as important as the binning. During the dereplication we take only one genome per species (default 95% identity). So you might get more bins, but then they are all from the same species and at the end (atlas run genome step) you only get one per species. Obviousl you can change also the parameters of the dereplication.

For your info: I will also soon add a new binning approach #365

slambrechts commented 3 years ago

Hi @SilasK

Thank you very much for your comment. I have yet to try atlas, so I don't have any stats on the individual assemblies atm (still trying to figure out how to use atlas on our HPC without violating rules), but the coassembly I have is indeed fragmented #375.

The PCOA was created by using 16S amplicon data, generated from the same DNA extracts as the metagenomic data. The more or less 3 groups I'm talking about are the brown cluster on the left, the blue-green cluster at the top, and the mixed cluster in the bottom right corner:

ordination_naked_2

Looking forward to try the different bingroup sizes, and compare how much bins we end up with for each approach after dereplication. Will try as you suggested and run atlas run binning and atlas run genome each time

SilasK commented 3 years ago

You know about the cluster profile and the section in the docs about how to run atlas on a cluster, isn't it?

slambrechts commented 3 years ago

@SilasK yes I do, and I would really like to use it like so, but I'm afraid we are not allowed to do so (speaking to the sys admin also). We have a high memory cluster with 16 nodes, 738 GB & 36 threads each, and a cluster with 128 nodes, 250 GB & 96 threads each (Ghent university HPC in Belgium). Maximum runtime on both of them is 72 hours. So should be sufficient to use atlas in single machine mode I guess?

SilasK commented 3 years ago

You only have two nodes? Or are there other with less memory?

slambrechts commented 3 years ago

@SilasK there are 3 other clusters, ranging from 88 to 177 GB

SilasK commented 3 years ago

Ok, In my case I have many more medium-memory nodes and I'm limited to 12h for most. The advantage of using the cluster wrapper is that you could use more than one node. using the medium-cluster nodes for the normal jobs and the high-memory for the assembly. Other users can still run jobs between the atlas steps.

In any case, I suggest testing different assembly, normalization variants on a subset of samples e.g. 3. It seems it' would be best if I add the normalization step to the atlas pipeline again. But even then it would be good to know which parameters to use for the normalization.

metagenome-atlas / atlas

BinGroup parameter #374