merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
425 stars 145 forks source link

[DISCUSSION] When to refine a metagenomic bin? #1908

Closed Laura-Alex closed 2 years ago

Laura-Alex commented 2 years ago

First, I'd like to thank you for previous help in getting anvi'o working.

However, now that I can use anvi-refine to improve bins, I realized that I am a bit stuck on the deeper issue of when exactly I should be doing that. Is there a risk of losing accessory genes? At what point is there the right balance between correctness and gene richness? I apologize if this question might be really basic, I am quite new to the field.

My original approach: use CheckM to assess completion / redundancy. (It uses more genes for my phylum of interest than anvi'o) Use anvi'o to refine all bins that have >5% redundancy to ideally <5%, or at least <10%. At least have a quick look at bins with 4-5% redundancy for outlier branches with different coverage.

My current doubts: I noticed that, occasionally, the divergent branches with unequal coverage (which anvi'o tutorials reccomend removing as likely being mis-binned), don't actually have much of an impact on the completeness/ contamination values. A few BLAST searches suggest that at least some of the genes contained in them might reasonably belong to the organism in question although I can only really tell this for the phylum I have more experience with), but these are also rare genes, which might be found in only a subset of species in an order.

So I guess I get the feeling that I'm losing these auxiliary genes, which could conceivable be present in only part of the population (explaining lower coverage) or maybe present on plasmids (explaining higher coverage). On one hand, even assuming the taxonomic assignment is correct, it's unlikely all those genes would be present in the same organism, so a bin with all of them might not reflect nature very well. However, would a bin without any of these extra genes be any better?

I am interested to find out how other people approach this!

tdelmont commented 2 years ago

Dear Laura,

It is great that you were able to use anvi'o to refine MAGs. Your questions make a lot of sense. Hypervariable regions and other genomic traits not shared equally by all members of a population are both highly interesting, and a limit when it comes to genome-resolved metagenomics using short reads.

In my view, one major yet overlooked issue with MAGs is the extent of silent contaminations; those refer to contigs that have no impact on completion and especially redundancy (no single copy core genes), yet do not belong to the focal environmental genome / MAG. Anvi'o allows identification of some of these issues that CheckM for instance would fail to recover, thanks in part to mapping results and the interface capabilities. Examples of silent contaminations in public databases are numerous and involve for instance contigs from eukaryotic viruses in marine bacterial MAGs that have been integrated into the blast search of NCBI...

This curation effort might depend on the researcher interests. Would one rather increase chances of keeping auxiliary genes while increasing contamination risks, or remove both auxiliary genes and putative contaminations. If one was to favor the former, then one major issue is that down the line the biological relevance of any original functional insight would be viewed as a likely contaminant.... I personally would not want to be trapped in this rather uncomfortable situation. This is why it is recommended by various users and developers to remove all the outliers (including when they do not impact completion or redundancy statistics) using sequence composition alone, then sequence composition alone, during the phase of manual curation with anvi'o.

In addition, I realize this might be viewed as a substantial amount of work by most, but because of these silent contaminations it is best in my view to visualize all MAGs regardless of how good they look from the eyes of single copy core genes... You might see some surprises on these so called "high quality MAGs". The extent of such problems is very much project dependent though.

Best regards

Tom Delmont

meren commented 2 years ago

Dear @Laura-Alex,

Thanks for starting this discussion. And thank you, @tdelmont, for your response. I also would like to add my 2 cents to the thread.

I apologize if this question might be really basic, I am quite new to the field.

Only those who are new to the field can ask such important questions. No need to apologize at all. We should apologize for not having very clear answers to them already.

Contamination is an important risk (since it can impact the placement of a genome in a tree, or determining its core/accessory genes accurately in a pangenome, or our assessment of its functional and/or metabolic potential, etc), but how to refine a genome bin properly is a difficult skill to gain as it requires both labor and expertise. Not doing it bad, doing it too conservatively is also bad since both will have biological consequences. Perhaps that's why most of us choose to outsource that problem and resolving this complexity to automatic binning algorithms with no further attention to their suggestions. Although, there are many questions for which this is probably the worst of all options.

Sole reliance on single-copy core genes is a big risk. We do it because it is doable (hopefully more algorithms will emerge, but I would rather hold my breath for better technologies for long-read sequencing). Tom's point on "silent contamination" is a very very important one. For instance, in this published genome the green part and green + orange part has the same contamination and redundancy based on SCGs (from this):

image

So that is a silent contamination that, in this case, makes this plaque-specific population look like a cosmopolitan one (since the silent contamination recruits a ton of reads from tongue metagenomes) that also occurs in tongue metagenomes in low abundance. So anyone who wishes to study genetic determinants of such tropism will exclude this guy (or include it into their study to understand what makes cosmopolitan populations cosmopolitan in human oral cavity, etc).

Perhaps even a more scary form of contamination is "complementary contamination", where mixtures of single-copy core genes from distinct taxa come together in such a complimentary way, the final genome looks like a good one based on C/R estimates, but at the end results in 'novel' branches in phylogenomic analyses :) When you generate thousands and thousands of genomes from single assemblies and if you are using SCGs to filter the bad ones out, you can be sure that your database of genomes will have many examples of the worst of the worst bins.

These problems are going to get resolved over time, especially with long-read sequencing and improvements in technologies that yield SAGs. But if you are working with short reads coming from complex metagenomes and your intention is to do genome-resolved metagenomics, the answer to the question "when one should do MAG refinement" is "depends on how important the quality of your MAGs to your downstream research". It is one thing to work on 1,000+ genomes in a study, where none of the genomes have any particular importance in a way, and it is another thing to go after a hypothesis through genome resolved metagenomics. For the latter, I think a round of refinement or at least taking a careful look at all your contigs in the context of all your metagenomes is a must. For the former, however, refinement is not that important (or possible) but I'd say "keep your unchecked MAGs out of public databases, please".

Is there a risk of losing accessory genes? At what point is there the right balance between correctness and gene richness?

The right balance is to not have any contamination. If your MAGs are too fragmented, and metagenomes that are available to you or the contextual information regarding the contigs (their hits on NCBI nr) are not sufficient to determine if something is not a contamination, then it is better to be conservative. You may very well be losing accessory genes, but if you can't confidently determine their origins, it may be better to lose them in some cases.

Viruses, hypervariable genomic islands, rRNA operon containing contigs, ICEs, plasmids, or other mobile genetic elements will have all their challenges when it comes to interpreting metagenomic coverages, but there are ways to make intelligible decisions. I think one develops an understanding over time, and anvi'o Slack channel includes many veterans in genome refinement 😂

You can always use the program anvi-split to split a critical bin from your project and save it as its own little package, put the resulting contigs-db, profile-db, and auxiliary-db somewhere online, and ask people's opinion regarding how to refine it. They can open it up on their computers and take a look at it interactively and send you collections.

But refinement gives more to those who do it. When I think about the amount of new insights I have gained into the nature of metagenomic data, the diversity of genomic architectures, and microbial pangenomes and their ecology, I can not over-emphasize the importance of genome refinement.

My 2 cents.

Meren.

Laura-Alex commented 2 years ago

Thank you so much, @tdelmont, @meren, for taking the time to share your insights.

It's a very good point to check all MAGs of interest regardless of values! I think my project is of a scale where that's still doable.

(I also find it hilarious when software surprises me. I had briefly considered whether something like complementary contamination could happen, but dismissed it as too ridiculous...my mistake!)

The right balance is to not have any contamination.

I guess my difficulty is understanding exactly what is and isn't contamination. It's (probably) not having 0% redundancy, because some variation in copy number of genes exists in genomes, from my understanding. In that case, when do I stop removing branches?

For example, in the work you showed, it's straightforward enough that the orange branch represents contamination. But what about the tiny branches that I circled in red? They have differing coverages and/or GC content. Should they be removed, if they don't affect completion? But if those are removed, other branches might appear as the outliers in turn...Is it fine to just check uncertain sequences with blast?

image

I'm concerned that if I only keep a 'core' genome for the MAGs, then pangenomic workflows including both MAGs and genomes might be biased in a way that is not meaningful (e.g. showing that the strains with sequenced genomes have more metabolic flexibility).

You're right that this is something that would be helped with more experience. Thank you for directing me towards the slack channel!

meren commented 2 years ago

I guess my difficulty is understanding exactly what is and isn't contamination. It's (probably) not having 0% redundancy, because some variation in copy number of genes exists in genomes, from my understanding.

Yes, not having 0% redundancy should never be the target. I can think of four reasons for that on the spot based on my observations:

In that case, when do I stop removing branches?

Yes, when do we stop removing branches?

Completion/redundancy estimates should never be the sole driver of removal of branches during a refinement effort. If the solution was that simple, we could write a program that could remove contigs from a MAG until the redundancy is 0% (as silly as that sounds as a solution, bioinformaticians did consider that as an option, which tells you who not to trust if you are a life scientist :p).

Differential coverage and tetranucleotide frequency (together and individually) are very powerful predictors of garbage when they are used 'effectively'. Their effective use require an understanding of dendrograms (the default way how anvi'o reports associations between contigs) and what they mean. Of course one can remove clusters of contigs that are attached to others with deep branches easily, since they would indicate completely distinctive behavior as far as differential coverage goes (like the orange set in the figure above), but there will always be outliers in the remaining branches (since what is an outlier and what branches are 'deep' is actually a function of scale in dendrograms). Indeed, if you were to remove orange set from your display and re-plot the dendrogram, clearly the next outlier is your first red circle, which would look as distinct as the orange one in the previous display. If the solution was to remove every outlier, then after a few iterations we would end up with a single contig from every MAG since there will always be a natural variation both for coverage and sequence composition.

The solution is to start thinking creatively and relying on multiple different orthogonal resources to make final calls. These orthogonal approaches include searching contigs in sequence databases, looking at gene functions and synteny, marking outliers in coverage and redrawing the tree based on sequence composition alone (to see if based on sequence composition 'coverage' outliers continue to be outliers or fall together with their comrades of similar evolutionary paths), including even more metagenomes to see if the risky contigs ever occur in metagenomes the rest of the population is not detected, and so on. That is how the true labor and expertise comes into this process. Through which one can keep hypervariable genomic islands, plasmids, or viruses of a population together with the MAG with confidence.

But this is a very difficult process to bootstrap indeed. If you do not have the expertise to be able to ask these questions and interpret answers you find, how will you make final decisions accurately? Well, you will make mistakes. By sometimes including too much, other times removing too much, and so on. But in the vast majority of cases you will still be much better off than a dumb automatic binning algorithm. Because you will get to say which MAGs you were unsure about, so you can always go back to them if they turn out to be the crown jewels of your study.


The understandable fear of making mistakes when we are sitting in the driver seat often pushes us away from manual work in data-intense biology and makes it appealing to outsource everything to computers .. which leaves the steering wheel to a blindfolded algorithm written by someone who never worked with our data. I'm not trivializing the difficulty of making the right decisions with refinement, but abandoning the power of making them altogether sounds like the opposite of what we need in metagenomics today.

Laura-Alex commented 2 years ago

Dear @meren,

Those are some very insightful points (down to the fear of making mistakes). This has been a very valuable discussion for me, and I can only hope it will help others who read it as well : )

Thank you!