Closed krassowski closed 2 years ago
Today I also came across this idea. When I analyzed word frequency for gene descriptions, the top words basically are not useful
word freq
gene 30664
protein 22489
transcript 11935
mirna 11481
family 8837
encoded 8260
encodes 7848
proteins 7586
variants 6983
involved 5223
member 4983
I think we can use word frequency from all GO terms/definitions/gene descriptions as background and maybe we use a binomial test to find the "over-represented words".
Some terms are just common among:
It would be very useful if the word cloud could show n most over-represented terms (as an optional replacement for to the current n most common terms). User would just need to provide a list of pathways to use as background.
Implementation wise I imagine keeping
count_word()
unchanged but adding an extra step (conditional on user providing background/setting a switch argument) inanno_word_cloud()
. If this sounds good I will be happy to work on it.