This PR addresses issue #2231, in which genes that match SCGs for multiple taxonomic domains sometimes lead to over-estimating the number of genomes belonging to those domains. I've implemented the solution suggested by @meren and @FlorianTrigodet, in which we don't consider any SCG models with hits to genes that also have hits to other SCG models from a different domain.
Copying here the summary of changes that I already described in that issue:
I changed the implementation of get_gene_hit_counts_per_hmm_source() so that it accepts a flag variable called dont_include_models_with_multiple_domain_hits. If that flag variable is True, the function doesn't count hits to any models for which genes could get hits from two different SCG collections. That is:
we go through all genes and make a list of the HMM models that each gene has hits to
if a given gene has a hit to models in more than one HMM source (ie, SCG collection), we take note of which models those are
later when we go through the HMM hits list, we don't count hits to those models; ie, those models are never placed into the gene_hit_counts dictionary that is returned to the get_num_genomes_from_SCG_sources_dict() function.
This PR also includes updated documentation, which is linked from the interactive display of anvi-display-contigs-stats.
See the linked issue for test results and other relevant discussions :)
This PR addresses issue #2231, in which genes that match SCGs for multiple taxonomic domains sometimes lead to over-estimating the number of genomes belonging to those domains. I've implemented the solution suggested by @meren and @FlorianTrigodet, in which we don't consider any SCG models with hits to genes that also have hits to other SCG models from a different domain.
Copying here the summary of changes that I already described in that issue:
This PR also includes updated documentation, which is linked from the interactive display of
anvi-display-contigs-stats
.See the linked issue for test results and other relevant discussions :)