[BUG] Different results when running samples together vs separate

Short description of the problem

I end up with different results per sample when I run co-assembly on all my samples at once vs going through the anvio pipeline for subsets of my files.

anvi'o version

Anvi'o .......................................: marie (v8) Python .......................................: 3.10.14

Profile database .............................: 38 Contigs database .............................: 21 Pan database .................................: 16 Genome data storage ..........................: 7 Auxiliary data storage .......................: 2 Structure database ...........................: 2 Metabolic modules database ...................: 4 tRNA-seq database ............................: 2

anvi-self-test --version

System info

Anvi'o is installed both locally and through our institution's research computing cluster (RCC). The RCC system is Linux/Unix I believe, and my computer is a Mac with an M3 silicon chip.

Detailed description of the issue

I have 2 sample groups that I'm looking at (n=10 for 'Group 1' and n=14 for 'Group 2'), and I decided to run anvio both with the normal approach (co-assembly of all samples) and by separating them into their respective groups. When I co-assemble all the samples at once and then run downstream analyses on my contigs.fa file, such as clustering the contigs into MAGs, I get one group of MAGs. However, new taxa show up when I run the groups separately that weren't present during the combined analysis.

For example, when running everything combined, I found Turicibacter sanguinis and Bilophila wadsworthia to be present in samples from both groups. However, when I only run 'Group 1' reads through the anvi'o pipeline, 2 species of Turicibacter and 0 Bilophila species show up. And when I only run 'Group 2' reads through the anvi'o pipeline, no Turicibacter is present and Bilophila (not wadsworthia) shows up.

The only difference between the ways anvi'o is being run is that the combined analysis includes all 24 samples (48 paired end read files), whereas the Group-specific analyses have 10 and 14 samples. Otherwise, I'm using the exact same code and workflow for all steps, so shouldn't there not be as much of a discrepancy between co-assembly and separated assembly? Am I just completely missing something, or why might this be the case?

We would need more information about the exact set of anvi'o commands you are running in each case to help you figure this out. But generally, I can think of both a biological explanation and a technical explanation for the results you are seeing.

The key question is, how are you estimating taxonomy in your metagenomic assemblies? Are you running anvi-estimate-scg-taxonomy in metagenome mode? If so, a possible technical explanation is that different single-copy core genes are being used for the taxonomic estimation. In metagenome mode, anvi-estimate-scg-taxonomy uses the most frequent SCG found in the assembly, and matches it to the SCGs of genomes in the Genome Taxonomy Database in order to figure out the closest matching taxa. The different co-assemblies might have different distributions of SCGs, and the SCG selected for taxonomy estimation could therefore be matching to slightly different taxa in GTDB. But of course this explanation only applies if this is how you estimated taxonomy of your MAGs.

If you did something different for estimating the taxonomy, it's still not that surprising to see variation (especially at the level of species) because different co-assemblies can put together different versions of population genomes (which, you might remember, are composite genomes made by combining all the sequence variations at the sub-population level. so different input samples => different distribution of sequence variants => different 'consensus' population genome). And since most taxonomy estimation methods rely on matching sequences to a reference database, the closest match in the reference database can differ depending on the 'version' of the population genome you got out of each co-assembly, especially if your population is novel. For instance, this could explain why you get Bilophila wadsworthia in one co-assembly but a different Bilophila in another.

The potential biological explanation that might also apply here has to do with the importance of coverage for assembling/binning microbial populations. If you don't have enough coverage of a given genome, it's hard to assemble it completely. For instance, if Bilophila populations are in very low abundance in your Group 1 samples, perhaps the Group 1 co-assembly simply didn't have enough reads to assemble it into contigs of significant length. But Bilophila is present in the Group 2 samples in enough abundance that the Group 2 assembly could put a significant chunk of its genome together. And in the co-assembly, Bilophila populations (if any) from Group 1 samples would be combined with those from Group 2 to produce a slightly different consensus MAG for Bilophila than you got from the Group 2 samples alone. Your assembled contigs then become your references for read-recruitment, and the differential coverage across each set of input samples will drive the binning process (in addition to the sequence composition signal), which could also lead to some variation across different co-assemblies.

So your taxonomy results depend both on the actual biological distribution of microbes across your samples (biology) and the variation/loss of information and each step of your workflow (technical factors). If you send more info about how you did this analysis, we might be able to provide more specific insight.

(thanks to @FlorianTrigodet who double checked my intuition here)

Thanks for the very detailed answer. I totally understand what you're saying in regards to variation within the dataset. In regards to your biological explanation, there are samples from both groups that have higher than average abundance and coverage values for Bilophila and Turicibacter (when looking at the co-assembled data), so it's still surprising to me that their abundance would be still be too low to assemble into contigs.

If it helps, the steps I'm taking to run my analysis include the following:

Quality control of the raw reads using illumina-utils
Co-assembly with Megahit (either with all 24 samples or divided into the 2 groups)
Mapping using bowtie2 and samtools, and then converting the files to BAMs using anvi-init-bam
Generation of a contigs database
Decorating the contigs database using anvi-run-kegg-kofams, anvi-run-hmms, and anvi-run-ncbi-cogs
Kaiju taxonomy to further decorate the contigs database
Profiling the BAM files
Binning using Metabat2 (I originally used CONCOCT but it generated horribly redundant MAGs, including several with >1000% redundancy)
anvi-run-scg-taxonomy and anvi-estimate-scg-taxonomy
Finally, I run anvi-summarize to determine which bins need refining and to look at the overall data.

Hey @lexikazen , in that case, I would guess that the technical explanation (use of different SCGs for anvi-estimate-taxonomy in metagenome mode) is what is driving the variability here :)

merenlab / anvio