merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
432 stars 145 forks source link

A fix for estimating number of genomes using single-copy core gene hits #2235

Closed ivagljiva closed 7 months ago

ivagljiva commented 7 months ago

This PR addresses issue #2231, in which genes that match SCGs for multiple taxonomic domains sometimes lead to over-estimating the number of genomes belonging to those domains. I've implemented the solution suggested by @meren and @FlorianTrigodet, in which we don't consider any SCG models with hits to genes that also have hits to other SCG models from a different domain.

Copying here the summary of changes that I already described in that issue:

I changed the implementation of get_gene_hit_counts_per_hmm_source() so that it accepts a flag variable called dont_include_models_with_multiple_domain_hits. If that flag variable is True, the function doesn't count hits to any models for which genes could get hits from two different SCG collections. That is:

  1. we go through all genes and make a list of the HMM models that each gene has hits to
  2. if a given gene has a hit to models in more than one HMM source (ie, SCG collection), we take note of which models those are
  3. later when we go through the HMM hits list, we don't count hits to those models; ie, those models are never placed into the gene_hit_counts dictionary that is returned to the get_num_genomes_from_SCG_sources_dict() function.

This PR also includes updated documentation, which is linked from the interactive display of anvi-display-contigs-stats.

See the linked issue for test results and other relevant discussions :)