Some phrases/terms make much more sense when analysed as a single term, for example one of my word clouds shows up "cycle" but I know that this is a combination of "cell cycle" and other cycles; I would like to single out "cell cycle" as a term that should not be split on space. It is a common practice when generating word clouds for research to specify a list of such n-grams that should be preserved.
On related note, it could be useful to allow to include all n-grams of specified length (up to specified n). The FAQ section of the tm package describes that this is possible by providing a custom tokenizer:
Some phrases/terms make much more sense when analysed as a single term, for example one of my word clouds shows up "cycle" but I know that this is a combination of "cell cycle" and other cycles; I would like to single out "cell cycle" as a term that should not be split on space. It is a common practice when generating word clouds for research to specify a list of such n-grams that should be preserved.
On related note, it could be useful to allow to include all n-grams of specified length (up to specified n). The FAQ section of the tm package describes that this is possible by providing a custom tokenizer:
A simple solution would be to expose the
control
list as an argument that users can customize (thus providing a custom tokenizer).