Closed EmilHvitfeldt closed 3 years ago
Interestingly, this is affected by the strip_punct
argument:
txt <- c("test string \U0001f4e6\U0001f47e",
"\U0001f4e6Don't match",
"\U0001f4e6\U0001f47e",
"\U0001f4e6#hashtag",
"\U0001f4e6@User")
tokenize_tweets(txt, strip_punct = TRUE)[[3]]
## [1] "\U0001f4e6\U0001f47e"
tokenize_tweets(txt, strip_punct = FALSE)[[3]]
## [1] "\U0001f4e6" "\U0001f47e"
Similar behaviour is observed in tokenize_words()
:
tokenize_words(txt, strip_punct = TRUE)[[3]]
## character(0)
tokenize_words(txt, strip_punct = FALSE)[[3]]
## [1] "\U0001f4e6" "\U0001f47e"
Interesting find! However this alone wouldn't be enough as some emojis have multiple units to allow modifiers such as skin-tone modifiers:
txt <- "\U0001f64c\U0001f3ff\U0001f4e6"
tokenize_words(txt, strip_punct = FALSE)[[1]]
[1] "\U0001f64c" "\U0001f3ff" "\U0001f4e6"
should in reality be
[1] "\U0001f64c\U0001f3ff" "\U0001f4e6"
as \U0001f3ff
is a "dark skin tone" added to "\U0001f64c
(🙌).
Another example is the flag emojis that combine two regional indicator symbols (letters) to make the flags work.
\U0001f1ee\U0001f1f9
should be kept together since its a flag 🇮🇹.
Tokenization into words in this package is heavily dependent on the stringi package. If @kbenoit is interested in putting together a PR for the tokenize_tweets()
function, I will of course be glad to accept it. But for the rest of the tokenizers, this issue with treating emoji as words may be better addressed in stringi.
I’ll be happy to look into this - in a few weeks when exam season is done here!
As the title says does tokenize_tweets not separate emojis that don't have a space between them. furthermore anything following a emoji (without a space) is grouped together with the emoji code
Is it something that can be handled or is it simply a limitation of the algorithm? :)