Open LeeBergstrand opened 7 months ago
@LeeBergstrand For now, I'd suggest to leave genome binning out of the MVP. I think the pipeline should already be compatible with metagenomes up until the end of the circularization step (i.e., before annotation), although I'd need to double check this just to make sure. This level of metagenome compatibility might be enough for the MVP -- users can use rotary to assembly metagenomes with properly closed circular contigs, and then they can handle genome binning themselves. Once the MVP is out, we could consider a meta-mode for rotary as an extension. How does this sound?
P.S. The current config file already has a way to turn meta mode on or off for Flye, so that aspect is already addressed. Meta mode is sometimes helpful for genome assemblies (e.g., if you're not sure if the culture is pure... I wonder if it might also help with assembling differentially abundant plasmids).
@LeeBergstrand For now, I'd suggest to leave genome binning out of the MVP. I think the pipeline should already be compatible with metagenomes up until the end of the circularization step (i.e., before annotation), although I'd need to double check this just to make sure. This level of metagenome compatibility might be enough for the MVP -- users can use rotary to assembly metagenomes with properly closed circular contigs, and then they can handle genome binning themselves. Once the MVP is out, we could consider a meta-mode for rotary as an extension. How does this sound?
@jmtsuji This sounds good to me. To me, it's a low priority at this time.
Here are some things to think about down the road:
@LeeBergstrand Good points. My guess is that existing genome binners (e.g., MetaBAT2) should work fine with Illumina, Nanopore, or hybrid data. MetaBAT2 just uses coverage info of the contigs (obtained from BAM files) and the contig sequences themselves to guide genome binning, in my understanding. So long as read mapping is accurate and the contigs are error-free, I think genome binning from a mix of different read types should be OK. It would be worthwhile to check this carefully later on, though.
@jmtsuji This is becoming more and more of an issue for me. We are finding out that more and more of the genomes we are processing are actually co-cultures even though they are originally thought to be single strain.
@LeeBergstrand Thanks for picking up this thread again. Yeah, it sounds like adding some basic genome binning could be helpful even for "pure culture" genome work.
We would probably just need some basic binning rules for rotary -- for example, map the reads to the assembled contigs (within the same sample), then just run 1 genome binner and split out the contigs. Then, the annotation module could be run on each bin separately. This might be pretty simple to implement. (Later on, we could always consider adding more genome binners and aggregating their results to improve binning accuracy, but I am not sure if this would improve things much given that the cultures should generally have a pretty simple microbial community.)
One potential issue we would need to address is how to handle binning of true isolates. The last time I tested binning tools carefully (a few years ago), they generally errored out if they could not produce at least 2 bins. We should see if this is still the case. If so, then we would need some strategy (e.g., based on CheckM2 scores) to figure out if the raw contigs are likely for a single isolate and then skip binning if that is the case.
Also, we could consider changing the default Flye mode to --meta
in the config file. My guess is that this might make some assemblies of true isolates worse in a few edge cases, but if the input data quality is good, it would have limited impact on isolate assemblies. Based on a quick look at the methods of the metaFlye paper, I assume the way that repeats in the assembly graph are identified in metaFlye should still work for isolates, but it might be more prone to errors than the algorithm used in the original Flye. I don't have any real evidence, though. I have seen some discussion on X that some folks prefer to use metaFlye by default. The alternative would be to try to predict if a dataset is pure or not before assembly and then choose the Flye mode based on that, but this approach might be too complicated.
@LeeBergstrand Any thoughts?
@jmtsuji, Questions:
Where would the optimal place to put binning be?
Right now, a vital issue is that Rotary needs to understand the concept of sub-samples (bins). We use the following design pattern throughout Rotary:
rule annotation:
input:
summaries=expand("{sample}/{sample}_annotation_summary.zip",sample=SAMPLE_NAMES),
In this pattern, we frequently use the SAMPLE_NAMES variable. However, this will not work when there are bins.
This issue is going to require significant refactoring to fix this issue.
I suggest waiting until we refactor things into pipeline-independent modules before pursuing binning. That way, you can call the annotation module on the bins or the single genomes.
@jmtsuji, Questions:
- How would polishing affect binning? Do you want to bin before or after polishing?
- How would a mixed metagenome affect our circularization code? Would you like to bin before circularization?
Where would the optimal place to put binning be?
Another option is that Rotary has a meta-mode but we do things in two steps. You run rotary in normal mode and we give you a list of genomes that are contamianted via CheckM. Then you take these samples and manually do a second run with them in meta-mode. The meta-mode in the config turns flye-meta and binning on and off depending on the flag.
It really depends on where binning happens. It will be easier to add a bunch of bin wild cards the later in the pipeline the binning occurs. There is also some modularization tools that might help here to.
We have previously discussed adding metagenomics compatibility by running fly in meta mode and doing genome binning.