Closed randomgambit closed 7 years ago
I think this is more text2vec question. You can create vocabulary with you list of words instead of inheriting it from data:
v = create_vocabulary(YOUR_LIST_OF_WORDS_CHARACTER_VECTOR)
vv = vocab_vectorizer(v)
dtm = create_dtm(it, vv)
@dselivanov OK that is interesting, thanks!
If I understand well, create_dtm
will map every document to the vv
vocabulary, which means it will be only considering the words that map to the original lexicon in v
.
May I ask just a follow up question? I am not sure how to use tokenizers
along with text2vec
. In particular, there seem to be some redundancy between tokenizers::tokenize_ngrams(james, n = 5, n_min = 2)
and text2vec::create_vocabulary(it_train, ngram = c(1L, 2L))
.
One should only use the tokenizer option once, right? Which one is more efficient? I would say using create_vocabulary` but you know better...
Thanks again~
create_dtm
will map every document to lexicon v
using vectorizer vv
( vectorizer is a function which maps words to indices). So yes, only words which are in v
will be considered. Other words will be omitted.
Regarding second question - yes, there is such redundancy. This is because tokenizers
works well not only with text2vec. If you need ngrams in text2vec then use tokenizers to just tokenize text into words (unigrams) and specify ngram argument in text2vec. This will be more efficient and information about ngrams
will be stored in vocabulary
. So when you will use vocabulary vectorize for dtm creation it will automatically understand that you need particular degree of ngrams.
I would recommend you to go through all vignettes for tokenizers and text2vec.
Thanks @dselivanov @lmullen for your time. I did go through all the vignettes, but sometimes it is just hard to see the big picture.
Keep up the great work!
Hello again,
I am wondering if
tokenizers
can use user-provided lexicons totokenize
a document.Something similar to http://tidytextmining.com/sentiment.html where one can use either the
afinn
,bing
,nrc
and the very well knownlougran
lexicon (https://www3.nd.edu/~mcdonald/Word_Lists.html) to only keep meaningful words in a document (BEFORE getting thedtm
matrix)Is it possible to do so in
text2vec
/tokenizers
?