jokergoo / simplifyEnrichment

Simplify functional enrichment results
https://jokergoo.github.io/simplifyEnrichment
Other
108 stars 16 forks source link

All two-letter gene/protein/metabolite names are disregarded #53

Closed krassowski closed 3 years ago

krassowski commented 3 years ago

The TermDocumentMatrix by default removes all one and two letter words. I think that the one letter words are not hugely important, but there are quite a few processes name abbreviations, genes, proteins and metabolites with two-letter names used in pathway names. It would be nice to allow to preserve those. The issue arises in the following line:

https://github.com/jokergoo/simplifyEnrichment/blob/3dbbaf52e68ba6e81bb3c2126dd15770f8d718d8/R/word_count.R#L40-L41

and can be reproduced with:

docs = Corpus(VectorSource(c('ER', 'PE', '333')))  # estrogen receptor, phosphatidylethanolamine, and three three
tdm = TermDocumentMatrix(docs)
v = sort(slam::row_sums(tdm), decreasing = TRUE)
print(v)
333 
  1 

and with:

print(tdm)
<<TermDocumentMatrix (terms: 1, documents: 3)>>
Non-/sparse entries: 1/2
Sparsity           : 67%
Maximal term length: 3
Weighting          : term frequency (tf)

I find the tm documentation very sparse, but this answer on SO tells us that we can fix that by passing control=list(wordLengths=c(1,Inf) which indeed works ok:

tdm = TermDocumentMatrix(docs, control=list(wordLengths=c(1,Inf)))
v = sort(slam::row_sums(tdm), decreasing = TRUE)
print(v)
 ar  pe 333 
  1   1   1 

Sorry for the deluge of issues!

jokergoo commented 3 years ago

I think we can also pre-analysis all GO/pathway terms/gene descriptions to get a white list of two-letter words.