Closed IsabelFE closed 3 years ago
Hi! This is exactly what I was looking for in the tutorials. It would certainly be a great addition! Has there been any update? And if not yet, did you find a way to perform such analysis?
Thank you!
@ivagljiva, while you are working on this tool, please take a look at this one as well :)
@aberaslop I couldn't find a tool to do this. I ended up just using the COG annotation data exported from Anvio and looking at % of each COG category in the Core vs. Accessory. Since a GC can have individual genes with different COG category assignments I did analysis at the gene level but counting each gene contribution as 1/x, being x the number or gene copies in a given GC. Then for all genes in a COG category, for example G, I counted the corrected gene contributions in the Core and in the Accessory and considered that category "enriched" in the Accessory if the ratio Accesory/Core was above a threshold. I hope this makes sense.
Thank you so much, @IsabelFE, @meren and @ivagljiva for your answer and for having a look at this suggestion! It is super helpful.
@IsabelFE, did you use the output table from anvi-summarize? I think I understand why you counted each gene contribution as 1/x. You would do that if there are several genes within a GC with different COG category assignment. But what would you do if one gene is assigned to two COG categories? Would you count that gene contribution (1/x) for each of the COG categories? Wouldn't that cause the sum of all %COG categories to be more than 100%?
Thank you so much for all your help!
If one gene was assigned to more than one COG I considered it an ambiguous assignment. I grouped all those as one category on my plot and called it ambiguous (they were around 10% of the total). It is not ideal, I know... Between genes with non-assigned COG, ambiguos COG and uninformative COG (categories S and R) I ended up with only 60% of informative COG assignments.
Hi @IsabelFE, thank you so much. I agree, it is not ideal, but it is a way to start looking into the data. Thanks again for your help!
I am glad that was helpful. If you find a better approach, please let me know!
I will certainly do!
@IsabelFE and @aberaslop, I have started to look into this, and realized I would benefit from some clarification from you guys. :)
When you say you want to do functional enrichment on "bins in a collection", do you mean that you would bin some gene clusters in your pangenome (into, for example, Core GCs and Accessory GCs) and you want to look for functions that are enriched in each bin of gene clusters?
I am sorry if this seems like an obvious question. In truth, when I first saw "bins in a collection" my mind jumped to metagenomic bins, not to pangenome bins, and I got very confused when reading through the rest of your comments. 😅 I think I get it now, but I would appreciate your input just to make sure :)
Hi @ivagljiva, I was thinking in Core GCs vs. Accessory GCs. But I have been thinking about this more and I am not sure if it makes sense. Then functional enrichment per pan group that is implemented now goes by each cluster and decides if that specific cluster is enriched in a group of genomes vs. another. The issue with Core vs. Accessory is that one gene cluster will either be in a bin or in the other. What I really wanted to do is to calculate if in general there is an enrichment in some COG categories in Core vs. Accessory.
Hey @IsabelFE, @ivagljiva and I discussed about this and came to the conclusion that the frequency of functions would be too low (too many distinct functions at low frequency in either groups) for a proper statistical analysis. The summary output should give everything necessary to summarize the counts of functions using that spreadsheet.
Best,
Thanks @meren, that is what I did. Best
Thanks for your input, @IsabelFE! I am going to close this issue since we decided not to implement this. :) But if there is anything else we can fix or improve, please let us know!
I asked in Slack if there was a way to use anvi-get-enriched-functions-per-pan-group to do functional enrichment on bins in a collection instead of on groups of genomes in a pangenome. Meren thought it was a good idea and suggested I open an issue here.
Maybe anvi-get-enriched-functions-per-collection-bin??
This can be used, for example, to do a Core vs Accessory gene clusters functional enrichment analysis.
Thanks!