ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
185 stars 25 forks source link

tokenize_tweets doesn't separate emojis with no spaces between them #68

Closed EmilHvitfeldt closed 3 years ago

EmilHvitfeldt commented 6 years ago

As the title says does tokenize_tweets not separate emojis that don't have a space between them. furthermore anything following a emoji (without a space) is grouped together with the emoji code

tokenizers::tokenize_tweets(c("test string \U0001f4e6\U0001f47e",
                              "\U0001f4e6Don't match",
                              "\U0001f4e6\U0001f47e",
                              "\U0001f4e6#hashtag",
                              "\U0001f4e6@User"))

[[1]]
[1] "test"                 "string"               "\U0001f4e6\U0001f47e"

[[2]]
[1] "\U0001f4e6dont" "match"         

[[3]]
[1] "\U0001f4e6\U0001f47e"

[[4]]
[1] "\U0001f4e6hashtag"

[[5]]
[1] "\U0001f4e6user"

Is it something that can be handled or is it simply a limitation of the algorithm? :)

kbenoit commented 6 years ago

Interestingly, this is affected by the strip_punct argument:

txt <- c("test string \U0001f4e6\U0001f47e",
                              "\U0001f4e6Don't match",
                              "\U0001f4e6\U0001f47e",
                              "\U0001f4e6#hashtag",
                              "\U0001f4e6@User")

tokenize_tweets(txt, strip_punct = TRUE)[[3]]
## [1] "\U0001f4e6\U0001f47e"
tokenize_tweets(txt, strip_punct = FALSE)[[3]]
## [1] "\U0001f4e6" "\U0001f47e"

Similar behaviour is observed in tokenize_words():

tokenize_words(txt, strip_punct = TRUE)[[3]]
## character(0)
tokenize_words(txt, strip_punct = FALSE)[[3]]
## [1] "\U0001f4e6" "\U0001f47e"
EmilHvitfeldt commented 6 years ago

Interesting find! However this alone wouldn't be enough as some emojis have multiple units to allow modifiers such as skin-tone modifiers:

txt <- "\U0001f64c\U0001f3ff\U0001f4e6"

tokenize_words(txt, strip_punct = FALSE)[[1]]
[1] "\U0001f64c" "\U0001f3ff" "\U0001f4e6"

should in reality be

[1] "\U0001f64c\U0001f3ff" "\U0001f4e6"

as \U0001f3ff is a "dark skin tone" added to "\U0001f64c (🙌).

Another example is the flag emojis that combine two regional indicator symbols (letters) to make the flags work.

\U0001f1ee\U0001f1f9 should be kept together since its a flag 🇮🇹.

lmullen commented 6 years ago

Tokenization into words in this package is heavily dependent on the stringi package. If @kbenoit is interested in putting together a PR for the tokenize_tweets() function, I will of course be glad to accept it. But for the rest of the tokenizers, this issue with treating emoji as words may be better addressed in stringi.

kbenoit commented 6 years ago

I’ll be happy to look into this - in a few weeks when exam season is done here!