jokergoo / simplifyEnrichment

Simplify functional enrichment results
https://jokergoo.github.io/simplifyEnrichment
Other
108 stars 16 forks source link

Preserve whitespace for selected phrases and/or allow to include n-grams #52

Closed krassowski closed 3 years ago

krassowski commented 3 years ago

Some phrases/terms make much more sense when analysed as a single term, for example one of my word clouds shows up "cycle" but I know that this is a combination of "cell cycle" and other cycles; I would like to single out "cell cycle" as a term that should not be split on space. It is a common practice when generating word clouds for research to specify a list of such n-grams that should be preserved.

On related note, it could be useful to allow to include all n-grams of specified length (up to specified n). The FAQ section of the tm package describes that this is possible by providing a custom tokenizer:

  BigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)

  tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))

A simple solution would be to expose the control list as an argument that users can customize (thus providing a custom tokenizer).