Closed TheOne000 closed 6 years ago
Try this stackoverflow explanation with a workaround. I have never done it myself.
Apparently, LDA requires TF not TfIdf because its measuring distributions.
I wouldn't recommend using LDA this way. I suppose you could do some data wrangling to get it into a useable format for LDA but the authors of LDA clearly wants Tf.
What exactly are you trying to accomplish?
Hi Ted
My plan is to use TF-IDF as a tool to take some terms out of the
corpus after the analytical pre-processing. As you know, words (that are not in a list of stop words) with high frequency do not always contribute meaningful information to the document.
'Term frequency' shows only how frequent the terms appear in the
document, but TF-IDF weights these term with 'rarity'. I would like to clean the corpus in this fashion before applying LDA with the corpus.
Thank you for your elaboration Sapphasak
2017-08-09 16:19 GMT+01:00 kwartler notifications@github.com:
Try this stackoverflow https://stackoverflow.com/questions/33770287/documenttermmatrix-needs-to-have-a-term-frequency-weighting-error explanation with a workaround. I have never done it myself. Apparently, LDA requires TF not TfIdf because its measuring distributions. I wouldn't recommend using LDA this way. I suppose you could do some data wrangling to get it into a useable format for LDA but the authors of LDA clearly wants Tf. What exactly are you trying to accomplish?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kwartler/text_mining/issues/1#issuecomment-321287936, or mute the thread https://github.com/notifications/unsubscribe-auth/AY61H0qwL9EtrwOAkvBS6jCa17IGXFwkks5sWc4JgaJpZM4OwbCn .
Was giving this some thought and I think you could perform some sort of tf-idf TDM, then apply a heuristic to identify the low quality terms.
library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude,
control = list(weighting =
function(x)
weightTfIdf(x, normalize =
FALSE),
stopwords = TRUE))
dtmM<-as.matrix(dtm)
tfScores<-colSums(dtmM)
tfScores<-data.frame(term=names(tfScores),tfScoring=tfScores)
tfScores<-tfScores[order(tfScores$tfScoring),]
# Then perform a subset based on deciling, or other heuristic for example
drops<- subset(tfScores$term,tfScores$tfScoring<=5) #or change to 0 etc.
drops<-as.character(drops)
drops
is a vector of terms that can be concatenated to the stop words list. The example above has no Corpus Cleaning functions applied so you would have to do that before. Then you would have a tfIDF version for LDA.
Dear Ted
Question: Can we input tf-idf document term matrix into Latent Dirichlet Allocation (LDA)? if yes, how?
it does not work in my case and the LDA function requires the 'term-frequency' document term matrix.
Thank you (I make a question as concise as possible. So, if you need more details, I can add