All two-letter gene/protein/metabolite names are disregarded

The TermDocumentMatrix by default removes all one and two letter words. I think that the one letter words are not hugely important, but there are quite a few processes name abbreviations, genes, proteins and metabolites with two-letter names used in pathway names. It would be nice to allow to preserve those. The issue arises in the following line:

https://github.com/jokergoo/simplifyEnrichment/blob/3dbbaf52e68ba6e81bb3c2126dd15770f8d718d8/R/word_count.R#L40-L41

and can be reproduced with:

docs = Corpus(VectorSource(c('ER', 'PE', '333')))  # estrogen receptor, phosphatidylethanolamine, and three three
tdm = TermDocumentMatrix(docs)
v = sort(slam::row_sums(tdm), decreasing = TRUE)
print(v)

333 
  1

and with:

print(tdm)

<<TermDocumentMatrix (terms: 1, documents: 3)>>
Non-/sparse entries: 1/2
Sparsity           : 67%
Maximal term length: 3
Weighting          : term frequency (tf)

I find the tm documentation very sparse, but this answer on SO tells us that we can fix that by passing control=list(wordLengths=c(1,Inf) which indeed works ok:

tdm = TermDocumentMatrix(docs, control=list(wordLengths=c(1,Inf)))
v = sort(slam::row_sums(tdm), decreasing = TRUE)
print(v)

 ar  pe 333 
  1   1   1

Sorry for the deluge of issues!

jokergoo / simplifyEnrichment

All two-letter gene/protein/metabolite names are disregarded #53