Closed hope-data-science closed 4 years ago
quanteda does this by allowing a "pad" (a ghost non-token) to remain in place of removed items, which are never formed into ngrams. We use this a lot to stop collocations from being detected when they are separated not only by punctuation characters, but also by stopwords. Too many existing approaches remove stuff and then consider tokens adjacent, after removal, that were never adjacent before!
library("quanteda")
## Package version: 1.5.2
txt <- "Hello my darling, how are you today?"
tokens(txt) %>%
tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
tokens_ngrams(n = 2)
## tokens from 1 document.
## text1 :
## [1] "Hello_my" "my_darling" "how_are" "are_you" "you_today"
Unlike skip-gram, when using n-gram we might want to extract meaningful phrases using tokenize_ngrams. Is there a way to split the text with a dictionary and do more precise split?
e.g. I want to tokenize "My study area is global warming.", with a dictionary c("study area","global warming")
, and I could finally get "My" "study area" "is" "global warming"
.
I think we might be on issues for the wrong package, but here you go:
library("quanteda")
## Package version: 1.5.2
txt <- "My study area is global warming."
dict <- phrase(c("study area", "global warming"))
tokens(txt) %>%
tokens_compound(pattern = dict, concatenator = " ") %>%
tokens(remove_punct = TRUE)
## tokens from 1 document.
## text1 :
## [1] "My" "study area" "is" "global warming"
This takes your phrase dictionary and combines the sequences post-tokenization into a single, compounded "token".
I am the one to be blame, but this is my initial problem to be solved. Excellent solution, can not thank you enough for the contribution you have made.
No worries! We love seeing people adapt the tokens functions in quanteda to solve these sorts of problems. I suggest studying the tokens_lookup()
, tokens_compound()
and phrase()
functions.
Working on it, I think there must be a general dictionary plus a professional user defined dictionary to complete this task. Any ideas for a suggested general dictionary? And as the dictionary size increases, is performance an issue?
None yet that I know of, but you can detect them using textstat_collocations()
.
Thank goodness I did not write this function myself. I think you can consider wrap the answered code and make it a general way for tokenization, this should be a flagship function for English tokenizers in R community. Maybe it could be included in \pkg{tokenizers}.
One more question:
library("quanteda")
## Package version: 1.5.2
txt <- "My study area is global warming."
dict <- phrase(c("My study","study area", "global warming"))
tokens(txt) %>%
tokens_compound(pattern = dict, concatenator = " ") %>%
tokens(remove_punct = TRUE)
## tokens from 1 document.
## text1 :
## [1] "My study area" "is" "global warming"
In large dictionary, there is usually this case. However, it it not desirable to get all the words concatenated. Is there a way to put weights and decide the priority of concatenation?
The functions only work if the objects are quanteda tokens objects, but any list of tokens from tokenizers can coerced into quanteda tokens using as.tokens()
.
There is no way to assign priorities for compounding, but you can get all of them using join = FALSE
:
> tokens(txt) %>%
+ tokens_compound(pattern = dict, concatenator = " ", join = FALSE) %>%
+ tokens(remove_punct = TRUE)
tokens from 1 document.
text1 :
[1] "My study" "study area" "is" "global warming"
@hope-data-science Glad you found a solution that works for you. Thanks, @kbenoit.
I want to get the n-grams using tokenize_ngrams function, however, I want it to be separated by the punctuation. e.g. "Hello my darling, how are you today?" should never output a 2-gram of "darling how". Any ideas to add this feature to tokenize_ngrams?