merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
432 stars 145 forks source link

[BUG] is get_enriched_groups(props, reps) working as intended? #2280

Open kafker opened 3 months ago

kafker commented 3 months ago

Short description of the problem

Dear Devs,

As far as I understand the 'associated groups' are decided before the enrichment analysis by the function get_enriched_groups. This function calculates in which groups a specific function has a proportion that is higher than the overall proportion across all groups.

In get_enriched_groups the overall proportions are calculated as follow:

overall_portion = np.sum(np.multiply(props, reps)) / np.sum(reps)

It is the number of genomes with that specific function divided by the total number of genomes. Here is an example:

group props genomes_in_group genomes_with_K02386
group1 0.125 8 1
group2 0.2307692307692308 13 3
group3 0.029761905 168 5
group4 0.7563218390804598 435 329
group5 0.078179697 857 67
group6 0.171875 64 11
group7 0 13 0
group8 0 14 0
group9 0.375 8 3

So the overall_proportion for K02386 should be 419/1580=0.26

However, the associated_groups for K02386 are all groups except group7 and group8, instead of just group4 where 0.75 > 0.26.

Is my understanding of get_enriched_groups correct or I am missing something?

Thank you A

anvi'o version

anvio-dev

Anvi'o .......................................: marie (v8-dev)
Python .......................................: 3.10.13

Profile database .............................: 40
Contigs database .............................: 23
Pan database .................................: 17
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 4
tRNA-seq database ............................: 2
ivagljiva commented 3 months ago

Hi @kafker , your understanding of get_enriched_groups matches mine, and appears to be correct. Here is the output of this function on your test case in Python:

import numpy as np
import anvio
from anvio import utils

props = [1/8,3/13,5/168,329/435,67/857,11/64,0/13,0/14,3/8]
# making sure the fractions match up to the table
props
# [0.125, 0.23076923076923078, 0.02976190476190476, 0.7563218390804598, 0.07817969661610269, 0.171875, 0.0, 0.0, 0.375]

reps = [8,13,168,435,857,64,13,14,8]

# check the overall proportion
overall_portion = np.sum(np.multiply(props, reps)) / np.sum(reps)
overall_portion
# 0.26518987341772154

utils.get_enriched_groups(props,reps)
# array([False, False, False,  True, False, False, False, False,  True])

As you can see, it returns True for group 4 (fraction 0.76) and group 9 (fraction 0.38), both of which have fractions greater than the overall proportion of 0.26.

So that then begs the question, why are you not seeing this result on your input data? There are a couple of ways I can think of to cause this:

  1. we are calculating the input proportions incorrectly
  2. we are calculating the input proportions correctly, but the proportions do not exactly match to the number of genomes with an annotation for K02386
  3. we are not using get_enriched_groups

Option 1 would be a bug, and we would need further info from you to help debug it.

Option 2 could happen in a couple of ways depending on which enrichment program you are using. I think for anvi-compute-functional-enrichment-across-genomes we take the best hit per gene. So if a gene with K02386 is also annotated with a different KO that has a better e-value, the K02386 annotation would not be counted. I've described this before on Discord here.

For anvi-compute-functional-enrichment-in-pan, when we compute the proportions (which is done in summarizer.py) we use the most frequent annotation in a given gene cluster (as computed by this function). So if a gene cluster includes K02386 but it is not the most frequent annotation in the cluster, it would not be counted.

Finally, Option 3 is the most likely based on what you've described:

However, the associated_groups for K02386 are all groups except group7 and group8

This could happen if you are running anvi-compute-functional-enrichment-across-genomes, because this script internally doesn't use get_enriched_groups. Rather, it identifies any group that has at least one genome with the function present:

associated_groups = [g for g in group_names if self.functions_across_groups_presence_absence[key_hash][g]]

(from here)

@meren , I think this is a bug. Why aren't we using get_enriched_groups in the AggregateFunctions class?

@kafker, Are you running anvi-compute-functional-enrichment-across-genomes? If so, I suspect option 3 is causing the output you've observed.

kafker commented 3 months ago

Hi @ivagljiva

Thank you very much for your answer.

I did the enrichment analysis with "anvi-compute-functional-enrichment-across-genomes" because we are working with more than 1500 MAGs and at the moment we are not interested in GC.

I think it is option 3. The use of presence/absence criterion to determine the associated_groups doesn't sound right to me. With this strategy only functions that are absent in at least one group are considered for the enrichment analysis. So if I have two groups A and B, with 500 genomes each and a function is present in all group A genomes and only in 1 group B genome then this function is not considered for the enrichment analysis. If that is the case then let the user decide the threshold below which a function can be considered absent.

Thank you! A

ivagljiva commented 3 months ago

I agree that it doesn't seem like the right approach, @kafker. I think that it would be more reasonable to change anvi-compute-functional-enrichment-across-genomes to use get_enriched_groups so that it has consistent behavior with the other two enrichment scripts.

meren commented 3 months ago

Thank you both very much. I'll take a look at this hopefully soon and fix this for anvi-compute-functional-enrichment-across-genomes.