merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
442 stars 146 forks source link

Feature request: anvi-get-enriched-functions-per-pan-group at the bin level #1432

Closed IsabelFE closed 3 years ago

IsabelFE commented 4 years ago

I asked in Slack if there was a way to use anvi-get-enriched-functions-per-pan-group to do functional enrichment on bins in a collection instead of on groups of genomes in a pangenome. Meren thought it was a good idea and suggested I open an issue here.

Maybe anvi-get-enriched-functions-per-collection-bin??

This can be used, for example, to do a Core vs Accessory gene clusters functional enrichment analysis.

Thanks!

aberaslop commented 4 years ago

Hi! This is exactly what I was looking for in the tutorials. It would certainly be a great addition! Has there been any update? And if not yet, did you find a way to perform such analysis?

Thank you!

meren commented 4 years ago

@ivagljiva, while you are working on this tool, please take a look at this one as well :)

IsabelFE commented 4 years ago

@aberaslop I couldn't find a tool to do this. I ended up just using the COG annotation data exported from Anvio and looking at % of each COG category in the Core vs. Accessory. Since a GC can have individual genes with different COG category assignments I did analysis at the gene level but counting each gene contribution as 1/x, being x the number or gene copies in a given GC. Then for all genes in a COG category, for example G, I counted the corrected gene contributions in the Core and in the Accessory and considered that category "enriched" in the Accessory if the ratio Accesory/Core was above a threshold. I hope this makes sense.

aberaslop commented 4 years ago

Thank you so much, @IsabelFE, @meren and @ivagljiva for your answer and for having a look at this suggestion! It is super helpful.

@IsabelFE, did you use the output table from anvi-summarize? I think I understand why you counted each gene contribution as 1/x. You would do that if there are several genes within a GC with different COG category assignment. But what would you do if one gene is assigned to two COG categories? Would you count that gene contribution (1/x) for each of the COG categories? Wouldn't that cause the sum of all %COG categories to be more than 100%?

Thank you so much for all your help!

IsabelFE commented 4 years ago

If one gene was assigned to more than one COG I considered it an ambiguous assignment. I grouped all those as one category on my plot and called it ambiguous (they were around 10% of the total). It is not ideal, I know... Between genes with non-assigned COG, ambiguos COG and uninformative COG (categories S and R) I ended up with only 60% of informative COG assignments.

aberaslop commented 4 years ago

Hi @IsabelFE, thank you so much. I agree, it is not ideal, but it is a way to start looking into the data. Thanks again for your help!

IsabelFE commented 4 years ago

I am glad that was helpful. If you find a better approach, please let me know!

aberaslop commented 4 years ago

I will certainly do!

ivagljiva commented 3 years ago

@IsabelFE and @aberaslop, I have started to look into this, and realized I would benefit from some clarification from you guys. :)

When you say you want to do functional enrichment on "bins in a collection", do you mean that you would bin some gene clusters in your pangenome (into, for example, Core GCs and Accessory GCs) and you want to look for functions that are enriched in each bin of gene clusters?

I am sorry if this seems like an obvious question. In truth, when I first saw "bins in a collection" my mind jumped to metagenomic bins, not to pangenome bins, and I got very confused when reading through the rest of your comments. 😅 I think I get it now, but I would appreciate your input just to make sure :)

IsabelFE commented 3 years ago

Hi @ivagljiva, I was thinking in Core GCs vs. Accessory GCs. But I have been thinking about this more and I am not sure if it makes sense. Then functional enrichment per pan group that is implemented now goes by each cluster and decides if that specific cluster is enriched in a group of genomes vs. another. The issue with Core vs. Accessory is that one gene cluster will either be in a bin or in the other. What I really wanted to do is to calculate if in general there is an enrichment in some COG categories in Core vs. Accessory.

meren commented 3 years ago

Hey @IsabelFE, @ivagljiva and I discussed about this and came to the conclusion that the frequency of functions would be too low (too many distinct functions at low frequency in either groups) for a proper statistical analysis. The summary output should give everything necessary to summarize the counts of functions using that spreadsheet.

Best,

IsabelFE commented 3 years ago

Thanks @meren, that is what I did. Best

ivagljiva commented 3 years ago

Thanks for your input, @IsabelFE! I am going to close this issue since we decided not to implement this. :) But if there is anything else we can fix or improve, please let us know!