quanteda / quanteda.textstats

Textual statistics for quanteda
GNU General Public License v3.0
14 stars 2 forks source link

How to match Emoji varaiant selector? #64

Open koheiw opened 10 months ago

koheiw commented 10 months ago

I noticed that \\p{Emoji_Presentation} is not matching emoji πŸ’ͺ️in a test because of the the variant selector "\ufe0f".

require(quanteda.textstats)
#> Loading required package: quanteda.textstats
require(quanteda)
#> Loading required package: quanteda
#> Package version: 4.0.0
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 4 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.

txt <- "Β£ € πŸ‘ Rock on❗ πŸ’ͺ️🎸"
toks <- tokens(txt)
toks
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "Β£"    "€"    "πŸ‘"   "Rock" "on"   "❗"   "πŸ’ͺ️"   "🎸"
tokens_select(toks, "^\\p{Emoji_Presentation}+$", valuetype = "regex")
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "πŸ‘" "❗" "🎸"
tokens_select(toks, "\\p{Emoji_Presentation}", valuetype = "regex")
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "πŸ‘" "❗" "πŸ’ͺ️" "🎸"
stringi::stri_extract_all_regex(txt, "\\p{Emoji_Presentation}")
#> [[1]]
#> [1] "πŸ‘" "❗" "πŸ’ͺ" "🎸"
stringi::stri_escape_unicode(types(toks))
#> [1] "\\u00a3"            "\\u20ac"            "\\U0001f44f"       
#> [4] "Rock"               "on"                 "\\u2757"           
#> [7] "\\U0001f4aa\\ufe0f" "\\U0001f3b8"

Created on 2023-10-18 with reprex v2.0.2