Question about co-binning behavior

mladen5000 commented 1 year ago

Congratulations on 2.18 - I have a question regarding co-binning.

Do other binners have co-abundance binning implemented, or just vamb?
Can we still use DAStool (now to aggregate over groups)

SilasK commented 1 year ago

Thank you for your question. I think the answers are all in the documentation linked from the release page. Please have a look. I am happy to clarify if necessary.

Maybe I didn't wrote that SemiBin uses also cobinning.

mladen5000 commented 1 year ago

I read the documentation, however I wasnt sure about the following line.

Although it’s not recommended, it’s feasible to use DASTool and feed it inputs from metabat and other co-abundance-based binners.

I realize its under the section for single sample binning, but I also saw that vamb is now the default binner (instead of DAS or SemiBin) so I wasn't sure if this was partially implemented, or vamb specific, or whether or not I could run DAStool (with metabat performing singlebins for example). But I see now that if I try to do all methods under DASbin, I get a warning that metabat will perform cross-binning.

As a side question, what are your thoughts on co-assembly for optimizing bin recovery? I was considering generating fastq files that are concatenations of all samples within the same group and manually adding these to the samples.tsv.

SilasK commented 1 year ago

Ok, you managed to do the co-abudance binning with all the binners? There was a warning for metabat? Which I probably should turn into an info.

Personally, I am not very happy with "ensemble" aproaches, e.g. try all boners and make a mixture of them. Rather try chech different and then choose the best.

But you seem to hahe challenges to find enough bins, isn't it? Gow many samples do you have?

Co-assembly

I observe that the field moved from Co-assembly to recommend single-sample assembly, in order to reduce strain heterogeneity . There are cases where you expect the same strains to be shared, e.g. mouse gut or longitudinal samples. In such cases co-assembly would make sense.

However, it is quite a drastic cahnge to implement it and I am not shire if it is worth. Also metaspades doesn't relly support it, if I am not mistaken.

If your goal is to recover low abundant species, it might make sense to use the reads that don't map to the assembly and co-assemble them. What do you think?

mladen5000 commented 1 year ago

Binning

I did cobinning with DASTool as the primary binner, drawing from maxbin + SemiBin + vamb. When metabat was selected, I got a warning that it would perform cross-binning would be require over 3000 mappings. I have about 140 samples but I can recover about 20 bins at most. This is likely due to the limited sequencing depth (mouse gut microbiome samples each under 500mb). Luckily, the genomes that are recovered are the same ones of primary interest found by traditional taxonomic abundance methods (mphlan/kraken2).

Co-assembly

I agree it would be tricky to implement, and now that co-binning is functional it doesn't seem totally necessary. I think what I was considering is generating concatenated fastq files for each group, and entering those manually into the sample list. This way I would have a Group1_R1.fastq.gz Group1_R2.fastq.gz, which in my situation would be aggregating reads from several different mice under the same conditions. So it presents some risks in terms of sample mixing, but might provide needed resolution.

As for unmapped reads (for some reason I only have about have the reads mapping back to the assembly), I think this is also feasible, however it might present a higher risk for chimeric genomes than the within group approach.

Regarding strain heterogeneity, I figured that the 95% threshold clustering with drep would simply group strains into a single species genome bin regardless of prior steps.

mladen5000 commented 1 year ago

For SemiBin, I see that each sample is run against the set of all bam files within a single group. But when running vamb it will only run once per group since it looks at the combined set of contigs.

From what I understand SemiBin can operate in the same way , reducing the number of times it is run and providing it with more data.

SilasK commented 1 year ago

For SemiBin, I tried to follow the same approach as for VAMB, albeit using the internal commands to optimize the distribution with sanakemake. But maybe there is a way to make this more efficient by parsing the abundance one per group, is there?

SilasK commented 1 year ago

I understand that you are not happy with 20 bins. Or are you mixing the total number of bins with the total number of dereplicated MAGs (Species 95%)? Check the reports/bin_report to see the total number of bins. I would like to know if vamb is much worse than all binner together.

If you not have enough input data this might explain why you don't have good assemblies. your approach of merging different samples makes total sense, to get the genomes. For quantification you can then always go back to the individual samples.

You might also want to try to use the genomes from my CMMG.

Finally, I am happy to fix bugs and have discussions about how to use atlas. Additionally, I'd like to offer the option of a focused consultation session, which could greatly accelerate progress. By hiring me for 1-2 hours of consulting, I believe you would be more effective than trying out everything yourself.

This consultation would enable us to delve into specific topics and challenges, resulting in targeted solutions for your project. In particular, I can assist you in exploring the sub-species of your species of interest. I suggest discussing this possibility with your supervisor. Looking forward to your thoughts on this proposal.

github-actions[bot] commented 10 months ago

There was no activity since some time. I hope your issue is solved in the mean time. This issue will automatically close soon if no further activity occurs.

Thank you for your contributions.

metagenome-atlas / atlas