Closed ingetic closed 6 years ago
The text2vec went through some changes. I have updated and commented the code below.
# similar getting data but this gets from github directly
library(data.table)
library(text2vec)
library(tm)
library(RCurl)
options(stringsAsFactors = F)
text<-read.csv(text=getURL("https://raw.githubusercontent.com/kwartler/text_mining/master/bos_airbnb_1k.csv"))
Same data clean up and organizatsion
airbnb<-data.table(review_id=text$review_id,comments=text$comments,review_scores_rating=text$review_scores_rating)
airbnb$comments<- removeWords(airbnb$comments,c(stopwords('en'),'Boston'))
airbnb$comments<- removePunctuation(airbnb$comments)
airbnb$comments<- stripWhitespace(airbnb$comments)
airbnb$comments<- removeNumbers(airbnb$comments)
airbnb$comments<- tolower(airbnb$comments)
The pacakge hasn't made changes to set up functions.
tokens <- strsplit(airbnb$comments, split = " ", fixed = T)
vocab <- create_vocabulary(itoken(tokens),ngram = c(1, 1))
vocab<- prune_vocabulary(vocab, term_count_min = 5)
After constructing the iteration the grow_dtm = F
and skip_grams_window = 5
were moved as function parameters to create_tcm
iter <- itoken(tokens)
vectorizer <- vocab_vectorizer(vocab) #
tcm <- create_tcm(iter, vectorizer,grow_dtm = F, skip_grams_window = 5)
As the error you reported states
'glove' is deprecated.
Use 'GloVe' instead.
See help("Deprecated")
So, the behavior needs to be rewritten as below. Keep in mind the number of iterations parameter is now called n_iter
and moved to the next code chunk instead of during the glove construction.
fit.glove <- GloVe$new(word_vectors_size = 50,
vocabulary = vocab, x_max = 10, learning_rate = .2)
Now you construct the word vectors.
# Word Vecs
word.vectors <- glove_model$fit_transform(tcm, n_iter = 15)
Different from the book I am using a different data set but the code has similar syntax. Here I am looking at terms for a "good rental". You should add and subtract terms to get the embedding for your particular data set.
good.rental <- word.vectors['value', , drop = FALSE] -
word.vectors['terrible', , drop = FALSE] +
word.vectors['comfortably', , drop = FALSE] +
word.vectors['good', , drop = FALSE] +
word.vectors['enjoyable', , drop = FALSE]+
word.vectors['perfectly', , drop = FALSE]
Keep in mind you can call rownames(word.vectors)
to see all the word vectors in the model to choose from.
The package now computes cosine similarities differently as shown below. Be sure to check the package documentation for methods which will impact your results.
cos.dist<-dist2(good.walks, y = word.vectors, method = c("cosine"),norm = 'none')
Finally, calling head()
on the result will show you the top words:
head(sort(cos.dist[1,], decreasing = T), 10)
In this example it will show you
> head(sort(cos.dist[1,], decreasing = T), 10)
showed walk welcome arrived questions street sleep get garage got
7.490211 6.124875 5.360422 5.070921 4.943469 4.601429 4.553640 4.541574 4.474744 4.470029
Problem in page 176
fit.glove <- glove(tcm = tcm, word_vectors_size = 50, x_max = 10, learning_rate = 0.2, num_iters = 15)
RStudio Say
Error in .subset2(public_bind_env, "initialize")(...) : unused argument (grain_size = 100000) Además: Warning message: 'glove' is deprecated. Use 'GloVe' instead. See help("Deprecated")