The TermDocumentMatrix by default removes all one and two letter words. I think that the one letter words are not hugely important, but there are quite a few processes name abbreviations, genes, proteins and metabolites with two-letter names used in pathway names. It would be nice to allow to preserve those. The issue arises in the following line:
docs = Corpus(VectorSource(c('ER', 'PE', '333'))) # estrogen receptor, phosphatidylethanolamine, and three three
tdm = TermDocumentMatrix(docs)
v = sort(slam::row_sums(tdm), decreasing = TRUE)
print(v)
333
1
and with:
print(tdm)
<<TermDocumentMatrix (terms: 1, documents: 3)>>
Non-/sparse entries: 1/2
Sparsity : 67%
Maximal term length: 3
Weighting : term frequency (tf)
I find the tm documentation very sparse, but this answer on SO tells us that we can fix that by passing control=list(wordLengths=c(1,Inf) which indeed works ok:
The
TermDocumentMatrix
by default removes all one and two letter words. I think that the one letter words are not hugely important, but there are quite a few processes name abbreviations, genes, proteins and metabolites with two-letter names used in pathway names. It would be nice to allow to preserve those. The issue arises in the following line:https://github.com/jokergoo/simplifyEnrichment/blob/3dbbaf52e68ba6e81bb3c2126dd15770f8d718d8/R/word_count.R#L40-L41
and can be reproduced with:
and with:
I find the
tm
documentation very sparse, but this answer on SO tells us that we can fix that by passingcontrol=list(wordLengths=c(1,Inf)
which indeed works ok:Sorry for the deluge of issues!