ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
184 stars 25 forks source link

Output ngram might consider punctuation separation? #74

Closed hope-data-science closed 4 years ago

hope-data-science commented 4 years ago

I want to get the n-grams using tokenize_ngrams function, however, I want it to be separated by the punctuation. e.g. "Hello my darling, how are you today?" should never output a 2-gram of "darling how". Any ideas to add this feature to tokenize_ngrams?

kbenoit commented 4 years ago

quanteda does this by allowing a "pad" (a ghost non-token) to remain in place of removed items, which are never formed into ngrams. We use this a lot to stop collocations from being detected when they are separated not only by punctuation characters, but also by stopwords. Too many existing approaches remove stuff and then consider tokens adjacent, after removal, that were never adjacent before!

library("quanteda")
## Package version: 1.5.2

txt <- "Hello my darling, how are you today?"

tokens(txt) %>%
  tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
  tokens_ngrams(n = 2)
## tokens from 1 document.
## text1 :
## [1] "Hello_my"   "my_darling" "how_are"    "are_you"    "you_today"
hope-data-science commented 4 years ago

Unlike skip-gram, when using n-gram we might want to extract meaningful phrases using tokenize_ngrams. Is there a way to split the text with a dictionary and do more precise split? e.g. I want to tokenize "My study area is global warming.", with a dictionary c("study area","global warming"), and I could finally get "My" "study area" "is" "global warming".

kbenoit commented 4 years ago

I think we might be on issues for the wrong package, but here you go:

library("quanteda")
## Package version: 1.5.2

txt <- "My study area is global warming."
dict <- phrase(c("study area", "global warming"))

tokens(txt) %>%
  tokens_compound(pattern = dict, concatenator = " ") %>%
  tokens(remove_punct = TRUE)
## tokens from 1 document.
## text1 :
## [1] "My"             "study area"     "is"             "global warming"

This takes your phrase dictionary and combines the sequences post-tokenization into a single, compounded "token".

hope-data-science commented 4 years ago

I am the one to be blame, but this is my initial problem to be solved. Excellent solution, can not thank you enough for the contribution you have made.

kbenoit commented 4 years ago

No worries! We love seeing people adapt the tokens functions in quanteda to solve these sorts of problems. I suggest studying the tokens_lookup(), tokens_compound() and phrase() functions.

hope-data-science commented 4 years ago

Working on it, I think there must be a general dictionary plus a professional user defined dictionary to complete this task. Any ideas for a suggested general dictionary? And as the dictionary size increases, is performance an issue?

kbenoit commented 4 years ago

None yet that I know of, but you can detect them using textstat_collocations().

hope-data-science commented 4 years ago

Thank goodness I did not write this function myself. I think you can consider wrap the answered code and make it a general way for tokenization, this should be a flagship function for English tokenizers in R community. Maybe it could be included in \pkg{tokenizers}.

hope-data-science commented 4 years ago

One more question:

library("quanteda")
## Package version: 1.5.2

txt <- "My study area is global warming."
dict <- phrase(c("My study","study area", "global warming"))

tokens(txt) %>%
  tokens_compound(pattern = dict, concatenator = " ") %>%
  tokens(remove_punct = TRUE)

## tokens from 1 document.
## text1 :
## [1] "My study area"  "is"             "global warming"

In large dictionary, there is usually this case. However, it it not desirable to get all the words concatenated. Is there a way to put weights and decide the priority of concatenation?

kbenoit commented 4 years ago

The functions only work if the objects are quanteda tokens objects, but any list of tokens from tokenizers can coerced into quanteda tokens using as.tokens().

There is no way to assign priorities for compounding, but you can get all of them using join = FALSE:

> tokens(txt) %>%
+   tokens_compound(pattern = dict, concatenator = " ", join = FALSE) %>%
+   tokens(remove_punct = TRUE)
tokens from 1 document.
text1 :
[1] "My study"       "study area"     "is"             "global warming"
lmullen commented 4 years ago

@hope-data-science Glad you found a solution that works for you. Thanks, @kbenoit.