[BUG] 'NumGenomesEstimator' sometimes double-counts some genomes as both Bacteria and Archaea

ivagljiva commented 4 months ago

Detailed description of the issue

We have this handy class NumGenomesEstimator in hmmops.py which estimates the number of bacteria, archaea, protista (and the total of those) in a given contigs db by using the single-copy core gene sets for each of these domains and taking the mode of the number of hits to all SCGs in a given set.

It usually works well, but I recently noticed that there is some overlap between the Bacteria_71 and Archaea_76 SCG sets, such that some archaeal genomes get hits from the Bacteria_71 SCGs (in addition to the archaeal ones) and some bacterial genomes get hits from the Archaea_76 SCGs. The consequence is that sometimes, when Bacteria and Archaea are mixed in the metagenomic community, some of the genomes get double-counted as both Bacteria and Archaea, thereby inflating the total number of populations estimated to be in the metagenome.

I noticed this when running tests on a large set of simulated metagenomes generated from GTDB, in which 20 random genomes were concatenated into one fasta file per sample. There were many samples with just one or two genomes double-counted, but there was one exceptionally bad case where the number of expected genomes was almost double what it should have been. Here I show the table of estimated populations for these synthetic communities with issues (all of them were supposed to have a total of 20):

name	Bacteria	Archaea	Total	true # Bacteria	true # Archaea
SYNTH_META_00000000000000000002	20	1	21.0	19	1
SYNTH_META_00000000000000000004	19	2	21.0	18	2
SYNTH_META_00000000000000000006	19	3	22.0	17	3
SYNTH_META_00000000000000000120	19	19	38.0	18	2

When I spot-checked the worst case, SYNTH_META_00000000000000000120, I saw that several gene calls were annotated with SCGs from two different domains:

$ sqlite3 02_CONTIGS/SYNTH_META_00000000000000000120-contigs.db "select * from hmm_hits where gene_callers_id=5334"
630|Bacteria_71|3a1924dcedd2273aadccd5d785106dbb0386a303fb742c423f4dae33|5334|Ribosomal_S2|PF00318.20|1.8e-78
1928|Archaea_76|3a1924dcedd2273aadccd5d785106dbb0386a303fb742c423f4dae33|5334|Ribosomal_S2|PF00318.20|1.9e-78

Note that this isn't a systematic issue: sometimes there isn't enough overlap in the SCG sequences for double-counting to occur.

The solution

@FlorianTrigodet proposed a solution to this, which is the following:

before computing the mode for each SCG domain set, remove any SCG hit on a gene which also has a hit from a different SCG set. Basically, anytime we see gene calls with SCG annotations from multiple domains, we drop those hits and don't include them in the mode estimates.

I will implement this now and test it using my synthetic metagenomes.

anvi'o version

Reported for v8, but potentially applies to earlier versions as well

ivagljiva commented 4 months ago

Lessons from testing

I've implemented the basic solution described above and have been testing it on SYNTH_META_00000000000000000120, my synthetic metagenome with 18 Bacteria and 2 Archaea. Unfortunately, it doesn't work as expected in this case.

Currently the solution is to get rid of all hits to genes with hits from multiple HMM sources. In this sample, 455 gene calls have hits to multiple SCG domains, which results in the removal of 925 hits (out of 1954 total, or 47%), leaving 1029 hits to be used for estimating the number of populations. Most of the genes that were removed from consideration had hits to both 'Bacteria_71' and 'Archaea_76', but I also counted ~15 gene calls that had hits to all three domains ('Bacteria_71', 'Protista_83', 'Archaea_76'), and 1 gene call that had hits to both 'Protista_83' and 'Archaea_76'.

Unfortunately, so many genes have multiple domain hits in this case that most of the bacterial and archaeal SCG counts are now 0, which means we predict 0 total genomes from each domain:

{'Protista_83': {'num_genomes': 0, 'domain': 'eukarya'}, 
 'Bacteria_71': {'num_genomes': 0, 'domain': 'bacteria'}, 
 'Archaea_76': {'num_genomes': 0, 'domain': 'archaea'}}

I also tested this on the other 3 metagenomes in the table above, and got similar results.

For the sample ending in 002: 493 genes affected, 1008 hits out of 2239 were removed, and 0 total genomes predicted
For the sample ending in 004: 500 genes affected, 1021 hits out of 2311 removed, and 0 genomes predicted
For the sample ending in 006: 482 genes affected, 985 hits out of 2181 removed, and 0 genomes predicted

I'm willing to bet that this strategy also ruins the calculation for the metagenomes that had good estimates before. In short, this removal is too drastic, and we need a more nuanced solution.

meren commented 4 months ago

I thought I understood the solution suggested, but reading these results I'm a bit confused :)

Instead of looking at hits, I thought we were going to look at models. I.e., if a given HMM is used in the SCG collections of both Bacteria and Archaea, then we do not include that model in our calculations of the mode of SCG frequencies. That should solve it without a big issue? :)

ivagljiva commented 4 months ago

Ah, yes, that is a better way to do it :)

I changed the implementation of get_gene_hit_counts_per_hmm_source() so that it accepts a flag variable called dont_include_models_with_multiple_domain_hits. If that flag variable is True, the function doesn't count hits to any models for which genes could get hits from two different SCG collections. That is: 1) we go through all genes and make a list of the HMM models that each gene has hits to 2) if a given gene has a hit to models in more than one HMM source (ie, SCG collection), we take note of which models those are 3) later when we go through the HMM hits list, we don't count hits to those models; ie, those models are never placed into the gene_hit_counts dictionary that is returned to the get_num_genomes_from_SCG_sources_dict() function.

When I tested it on the four synthetic metagenomes, I got the following output (the 'Hello there' messages will be self.run.warning() statements when seen from the command line:

>>> from anvio.hmmops import NumGenomesEstimator
>>> NumGenomesEstimator('SYNTH_META_00000000000000000002-contigs.db').estimates_dict
Hello there from the SequencesForHMMHits.get_gene_hit_counts_per_hmm_source() function. Just so you know, someone asked for SCG HMMs that belong to multiple sources *not* to be counted, and this will result in 60 models to be removed from our counts, more specifically: 4 from Protista_83, 27 from Bacteria_71, 29 from Archaea_76. You can run this program with the `--debug` flag if you want to see a list of the models that we will ignore from each HMM source.
{'Protista_83': {'num_genomes': 0, 'domain': 'eukarya'}, 'Bacteria_71': {'num_genomes': 19, 'domain': 'bacteria'}, 'Archaea_76': {'num_genomes': 1, 'domain': 'archaea'}}
>>> NumGenomesEstimator('SYNTH_META_00000000000000000004-contigs.db').estimates_dict
Hello there from the SequencesForHMMHits.get_gene_hit_counts_per_hmm_source() function. Just so you know, someone asked for SCG HMMs that belong to multiple sources *not* to be counted, and this will result in 62 models to be removed from our counts, more specifically: 4 from Protista_83, 29 from Bacteria_71, 29 from Archaea_76. You can run this program with the `--debug` flag if you want to see a list of the models that we will ignore from each HMM source.
{'Protista_83': {'num_genomes': 0, 'domain': 'eukarya'}, 'Bacteria_71': {'num_genomes': 17, 'domain': 'bacteria'}, 'Archaea_76': {'num_genomes': 2, 'domain': 'archaea'}}
>>> NumGenomesEstimator('SYNTH_META_00000000000000000006-contigs.db').estimates_dict
Hello there from the SequencesForHMMHits.get_gene_hit_counts_per_hmm_source() function. Just so you know, someone asked for SCG HMMs that belong to multiple sources *not* to be counted, and this will result in 60 models to be removed from our counts, more specifically: 4 from Protista_83, 28 from Bacteria_71, 28 from Archaea_76. You can run this program with the `--debug` flag if you want to see a list of the models that we will ignore from each HMM source.
{'Protista_83': {'num_genomes': 0, 'domain': 'eukarya'}, 'Bacteria_71': {'num_genomes': 16, 'domain': 'bacteria'}, 'Archaea_76': {'num_genomes': 3, 'domain': 'archaea'}}
>>> NumGenomesEstimator('SYNTH_META_00000000000000000120-contigs.db').estimates_dict
Hello there from the SequencesForHMMHits.get_gene_hit_counts_per_hmm_source() function. Just so you know, someone asked for SCG HMMs that belong to multiple sources *not* to be counted, and this will result in 59 models to be removed from our counts, more specifically: 4 from Protista_83, 27 from Bacteria_71, 28 from Archaea_76. You can run this program with the `--debug` flag if you want to see a list of the models that we will ignore from each HMM source.
{'Protista_83': {'num_genomes': 0, 'domain': 'eukarya'}, 'Bacteria_71': {'num_genomes': 16, 'domain': 'bacteria'}, 'Archaea_76': {'num_genomes': 2, 'domain': 'archaea'}}

To summarize the results:

name	Bacteria	Archaea	Total	true # Bacteria	true # Archaea
SYNTH_META_00000000000000000002	19	1	20.0	19	1
SYNTH_META_00000000000000000004	17	2	19.0	18	2
SYNTH_META_00000000000000000006	16	3	19.0	17	3
SYNTH_META_00000000000000000120	16	2	18.0	18	2

So we have now solved the overestimation problem, and one of these sample has a correct estimate now while the other 3 underestimate the number of bacteria by 1 (which is not as bad of an issue).

ivagljiva commented 4 months ago

@meren, I have a question regarding how we should update anvi-summarize and anvi-display-contigs-stats. The data for anvi-display-contigs-stats is coming from a call to summarizer.ContigSummarizer.get_summary_dict_for_assembly() in summarizer.py, which in turn calls the updated hmmops.py functions.

Currently, any call to hmmops.get_num_genomes_from_SCG_sources_dict() uses the new flag variable, so the estimates reflect the new way of doing things:

But the SCG histograms at the top of the page are generated via a direct call to hmmops.get_gene_hit_counts_per_hmm_source() without the flag variable, so they still show all hits (even to those models that we are ignoring):

I could change the latter function call to use the new flag so that the histograms in anvi-display-contigs-stats accurately reflect our num genome estimates, BUT because this data is generated via anvi-summarize, this would force anvi-summarize to report no SCG hits for the affected models. I don't think this is a good idea. Can I leave the histograms as-is?

meren commented 4 months ago

Let's leave the histogram as is. Much less headache :)

But I think would've been awesome if we had a short paragraph that describes this workflow somewhere on the docs and a link to that from the anvi-display-contigs-stats page.

ivagljiva commented 4 months ago

Done :) There is now a section in the anvi-display-contigs-stats help page that looks like this:

It is linked from both the interactive page (replacing the link to the old blog post) and from the contigs-db section about using the NumGenomesEstimator class.

Please take a look, and let me know if there is anything to be changed :)

Also, I noticed while updating the docs that the predictions for the Infant Gut dataset have changed. We now predict 10 bacterial genomes in that data, rather than 9. It is because of overlap from the following models:

WARNING
===============================================
Hello there from the SequencesForHMMHits.get_gene_hit_counts_per_hmm_source()
function. Just so you know, someone asked for SCG HMMs that belong to multiple
sources *not* to be counted, and this will result in 70 models to be removed
from our counts, more specifically: 29 from Bacteria_71, 31 from Archaea_76, 10
from Protista_83. You can run this program with the `--debug` flag if you want
to see a list of the models that we will ignore from each HMM source.

* Models to be ignored for source Bacteria_71: Ribosomal_S8, Ribosomal_L13,
  Ribosomal_S9, Ribosomal_L3, Ribosomal_S19, Ribosomal_L14, Ribosomal_S15,
  Ribosomal_S2, Ribosomal_L4, Ribosomal_L16, SecY, Ribosom_S12_S23,
  Ribosomal_S13, Ribosomal_L22, Adenylsucc_synt, Ribosomal_S7, RNA_pol_Rpb6,
  Ribosomal_L2, Ribosomal_S11, Ribosomal_L29, Ribosomal_L23, tRNA-synt_1d,
  Ribosomal_S17, Ribosomal_L6, eIF-1a, Ribosomal_L1, Ham1p_like, Ribosomal_L27A,
  RNA_pol_L
* Models to be ignored for source Archaea_76: Ribosomal_S8, Ribosomal_L13,
  Ribosomal_S9, Ribosomal_L3, Ribosomal_S19, Ribosomal_L14, RNA_pol_L_2,
  Ribosomal_S15, Ribosomal_S24e, ATP-synt_F, Ribosomal_S2, Ribosomal_S8e,
  Ribosomal_L4, ATP-synt_D, Ribosomal_L16, SecY, Ribosom_S12_S23, Ribosomal_S13,
  Ribosomal_L22, Diphthamide_syn, Adenylsucc_synt, Ribosomal_S7, RNA_pol_Rpb6,
  Ribosomal_S11, Ribosomal_L29, Ribosomal_L23, tRNA-synt_1d, Ribosomal_S17,
  Ribosomal_L6, Ribosomal_L1, Ham1p_like
* Models to be ignored for source Protista_83: EPrGT00050000005732,
  EPrGT00050000006117, EPrGT00050000005182, EPrGT00050000005852,
  EPrGT00050000005482, EPrGT00050000005111, EPrGT00050000006007,
  EPrGT00050000006107, EPrGT00300000062292, EPrGT00050000006045

If this is cause for concern, let me know! We could refine the implementation a bit more. One alternative idea that @FlorianTrigodet discussed with me is that instead of throwing away the models, we could keep the best hit (ie, best e-value) to each gene with multi-domain hits.

meren commented 4 months ago

This is great :) Thank you very much, Iva. I don't think it is a concern -- this is not a method that will be 100% accurate for many reasons, but it will get very close to the truth (as it is the case now).

ivagljiva commented 4 months ago

Yay! I will open a PR, then :)

merenlab / anvio