ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
184 stars 25 forks source link

can we use alternative lexicons? #47

Closed randomgambit closed 7 years ago

randomgambit commented 7 years ago

Hello again,

I am wondering if tokenizers can use user-provided lexicons to tokenize a document.

Something similar to http://tidytextmining.com/sentiment.html where one can use either the afinn, bing, nrc and the very well known lougran lexicon (https://www3.nd.edu/~mcdonald/Word_Lists.html) to only keep meaningful words in a document (BEFORE getting the dtm matrix)

Is it possible to do so in text2vec / tokenizers ?

dselivanov commented 7 years ago

I think this is more text2vec question. You can create vocabulary with you list of words instead of inheriting it from data:

v = create_vocabulary(YOUR_LIST_OF_WORDS_CHARACTER_VECTOR)
vv = vocab_vectorizer(v)
dtm = create_dtm(it, vv)
randomgambit commented 7 years ago

@dselivanov OK that is interesting, thanks!

If I understand well, create_dtm will map every document to the vv vocabulary, which means it will be only considering the words that map to the original lexicon in v.

May I ask just a follow up question? I am not sure how to use tokenizers along with text2vec. In particular, there seem to be some redundancy between tokenizers::tokenize_ngrams(james, n = 5, n_min = 2) and text2vec::create_vocabulary(it_train, ngram = c(1L, 2L)).

One should only use the tokenizer option once, right? Which one is more efficient? I would say using create_vocabulary` but you know better...

Thanks again~

dselivanov commented 7 years ago

create_dtm will map every document to lexicon v using vectorizer vv( vectorizer is a function which maps words to indices). So yes, only words which are in v will be considered. Other words will be omitted.

Regarding second question - yes, there is such redundancy. This is because tokenizers works well not only with text2vec. If you need ngrams in text2vec then use tokenizers to just tokenize text into words (unigrams) and specify ngram argument in text2vec. This will be more efficient and information about ngrams will be stored in vocabulary. So when you will use vocabulary vectorize for dtm creation it will automatically understand that you need particular degree of ngrams. I would recommend you to go through all vignettes for tokenizers and text2vec.

randomgambit commented 7 years ago

Thanks @dselivanov @lmullen for your time. I did go through all the vignettes, but sometimes it is just hard to see the big picture.

Keep up the great work!