jokergoo / simplifyEnrichment

Simplify functional enrichment results
https://jokergoo.github.io/simplifyEnrichment
Other
108 stars 16 forks source link

Word cloud: show most over-represented words, not the most frequent ones #58

Closed krassowski closed 2 years ago

krassowski commented 3 years ago

Some terms are just common among:

It would be very useful if the word cloud could show n most over-represented terms (as an optional replacement for to the current n most common terms). User would just need to provide a list of pathways to use as background.

Implementation wise I imagine keeping count_word() unchanged but adding an extra step (conditional on user providing background/setting a switch argument) in anno_word_cloud(). If this sounds good I will be happy to work on it.

jokergoo commented 3 years ago

Today I also came across this idea. When I analyzed word frequency for gene descriptions, the top words basically are not useful

 word                 freq
 gene                 30664
 protein              22489
 transcript           11935
 mirna                11481
 family                8837
 encoded               8260
 encodes               7848
 proteins              7586
 variants              6983
 involved              5223
 member                4983

I think we can use word frequency from all GO terms/definitions/gene descriptions as background and maybe we use a binomial test to find the "over-represented words".