kwartler / text_mining

This repo contains data from Ted Kwartler's "Text Mining in Practice With R" book.
53 stars 67 forks source link

Problem with glove function. Page 176 #4

Closed ingetic closed 6 years ago

ingetic commented 6 years ago

Problem in page 176

fit.glove <- glove(tcm = tcm, word_vectors_size = 50, x_max = 10, learning_rate = 0.2, num_iters = 15)

RStudio Say

Error in .subset2(public_bind_env, "initialize")(...) : unused argument (grain_size = 100000) Además: Warning message: 'glove' is deprecated. Use 'GloVe' instead. See help("Deprecated")

kwartler commented 6 years ago

The text2vec went through some changes. I have updated and commented the code below.

# similar getting data but this gets from github directly
library(data.table)
library(text2vec)
library(tm)
library(RCurl)

options(stringsAsFactors = F)

text<-read.csv(text=getURL("https://raw.githubusercontent.com/kwartler/text_mining/master/bos_airbnb_1k.csv"))

Same data clean up and organizatsion

airbnb<-data.table(review_id=text$review_id,comments=text$comments,review_scores_rating=text$review_scores_rating)
airbnb$comments<- removeWords(airbnb$comments,c(stopwords('en'),'Boston'))
airbnb$comments<- removePunctuation(airbnb$comments)
airbnb$comments<- stripWhitespace(airbnb$comments)
airbnb$comments<- removeNumbers(airbnb$comments)
airbnb$comments<- tolower(airbnb$comments)

The pacakge hasn't made changes to set up functions.

tokens <- strsplit(airbnb$comments, split = " ", fixed = T)
vocab <- create_vocabulary(itoken(tokens),ngram = c(1, 1))
vocab<- prune_vocabulary(vocab, term_count_min = 5)

After constructing the iteration the grow_dtm = F and skip_grams_window = 5 were moved as function parameters to create_tcm

iter <- itoken(tokens)
vectorizer <- vocab_vectorizer(vocab) #
tcm <- create_tcm(iter, vectorizer,grow_dtm = F, skip_grams_window = 5)

As the error you reported states

'glove' is deprecated.
Use 'GloVe' instead.
See help("Deprecated")

So, the behavior needs to be rewritten as below. Keep in mind the number of iterations parameter is now called n_iter and moved to the next code chunk instead of during the glove construction.

fit.glove <- GloVe$new(word_vectors_size = 50,
                        vocabulary = vocab, x_max = 10, learning_rate = .2)

Now you construct the word vectors.

# Word Vecs
word.vectors <- glove_model$fit_transform(tcm, n_iter = 15)

Different from the book I am using a different data set but the code has similar syntax. Here I am looking at terms for a "good rental". You should add and subtract terms to get the embedding for your particular data set.

good.rental <- word.vectors['value', , drop = FALSE] -
  word.vectors['terrible', , drop = FALSE] + 
  word.vectors['comfortably', , drop = FALSE] + 
  word.vectors['good', , drop = FALSE] +
  word.vectors['enjoyable', , drop = FALSE]+
  word.vectors['perfectly', , drop = FALSE]

Keep in mind you can call rownames(word.vectors) to see all the word vectors in the model to choose from.

The package now computes cosine similarities differently as shown below. Be sure to check the package documentation for methods which will impact your results.

cos.dist<-dist2(good.walks, y = word.vectors, method = c("cosine"),norm = 'none')

Finally, calling head() on the result will show you the top words:

head(sort(cos.dist[1,], decreasing = T), 10)

In this example it will show you

> head(sort(cos.dist[1,], decreasing = T), 10)
showed      walk   welcome   arrived questions    street     sleep       get    garage       got 
7.490211  6.124875  5.360422  5.070921  4.943469  4.601429  4.553640  4.541574  4.474744  4.470029