Open kafker opened 3 months ago
Hi @kafker , your understanding of get_enriched_groups
matches mine, and appears to be correct. Here is the output of this function on your test case in Python:
import numpy as np
import anvio
from anvio import utils
props = [1/8,3/13,5/168,329/435,67/857,11/64,0/13,0/14,3/8]
# making sure the fractions match up to the table
props
# [0.125, 0.23076923076923078, 0.02976190476190476, 0.7563218390804598, 0.07817969661610269, 0.171875, 0.0, 0.0, 0.375]
reps = [8,13,168,435,857,64,13,14,8]
# check the overall proportion
overall_portion = np.sum(np.multiply(props, reps)) / np.sum(reps)
overall_portion
# 0.26518987341772154
utils.get_enriched_groups(props,reps)
# array([False, False, False, True, False, False, False, False, True])
As you can see, it returns True
for group 4 (fraction 0.76) and group 9 (fraction 0.38), both of which have fractions greater than the overall proportion of 0.26.
So that then begs the question, why are you not seeing this result on your input data? There are a couple of ways I can think of to cause this:
get_enriched_groups
Option 1 would be a bug, and we would need further info from you to help debug it.
Option 2 could happen in a couple of ways depending on which enrichment program you are using. I think for anvi-compute-functional-enrichment-across-genomes
we take the best hit per gene. So if a gene with K02386 is also annotated with a different KO that has a better e-value, the K02386 annotation would not be counted. I've described this before on Discord here.
For anvi-compute-functional-enrichment-in-pan
, when we compute the proportions (which is done in summarizer.py
) we use the most frequent annotation in a given gene cluster (as computed by this function). So if a gene cluster includes K02386 but it is not the most frequent annotation in the cluster, it would not be counted.
Finally, Option 3 is the most likely based on what you've described:
However, the associated_groups for K02386 are all groups except group7 and group8
This could happen if you are running anvi-compute-functional-enrichment-across-genomes
, because this script internally doesn't use get_enriched_groups
. Rather, it identifies any group that has at least one genome with the function present:
associated_groups = [g for g in group_names if self.functions_across_groups_presence_absence[key_hash][g]]
(from here)
@meren , I think this is a bug. Why aren't we using get_enriched_groups
in the AggregateFunctions
class?
@kafker, Are you running anvi-compute-functional-enrichment-across-genomes
? If so, I suspect option 3 is causing the output you've observed.
Hi @ivagljiva
Thank you very much for your answer.
I did the enrichment analysis with "anvi-compute-functional-enrichment-across-genomes" because we are working with more than 1500 MAGs and at the moment we are not interested in GC.
I think it is option 3
. The use of presence/absence criterion to determine the associated_groups
doesn't sound right to me. With this strategy only functions that are absent in at least one group are considered for the enrichment analysis. So if I have two groups A and B, with 500 genomes each and a function is present in all group A genomes and only in 1 group B genome then this function is not considered for the enrichment analysis. If that is the case then let the user decide the threshold below which a function can be considered absent.
Thank you! A
I agree that it doesn't seem like the right approach, @kafker. I think that it would be more reasonable to change anvi-compute-functional-enrichment-across-genomes
to use get_enriched_groups
so that it has consistent behavior with the other two enrichment scripts.
Thank you both very much. I'll take a look at this hopefully soon and fix this for anvi-compute-functional-enrichment-across-genomes
.
Short description of the problem
Dear Devs,
As far as I understand the 'associated groups' are decided before the enrichment analysis by the function
get_enriched_groups
. This function calculates in which groups a specific function has a proportion that is higher than the overall proportion across all groups.In
get_enriched_groups
the overall proportions are calculated as follow:overall_portion = np.sum(np.multiply(props, reps)) / np.sum(reps)
It is the number of genomes with that specific function divided by the total number of genomes. Here is an example:
So the overall_proportion for K02386 should be
419/1580=0.26
However, the
associated_groups
for K02386 are all groups exceptgroup7
andgroup8
, instead of justgroup4
where0.75 > 0.26
.Is my understanding of
get_enriched_groups
correct or I am missing something?Thank you A
anvi'o version
anvio-dev