Open koheiw opened 10 months ago
I noticed that \\p{Emoji_Presentation} is not matching emoji πͺοΈin a test because of the the variant selector "\ufe0f".
\\p{Emoji_Presentation}
require(quanteda.textstats) #> Loading required package: quanteda.textstats require(quanteda) #> Loading required package: quanteda #> Package version: 4.0.0 #> Unicode version: 14.0 #> ICU version: 70.1 #> Parallel computing: 4 of 4 threads used. #> See https://quanteda.io for tutorials and examples. txt <- "Β£ β¬ π Rock onβ πͺοΈπΈ" toks <- tokens(txt) toks #> Tokens consisting of 1 document. #> text1 : #> [1] "Β£" "β¬" "π" "Rock" "on" "β" "πͺοΈ" "πΈ" tokens_select(toks, "^\\p{Emoji_Presentation}+$", valuetype = "regex") #> Tokens consisting of 1 document. #> text1 : #> [1] "π" "β" "πΈ" tokens_select(toks, "\\p{Emoji_Presentation}", valuetype = "regex") #> Tokens consisting of 1 document. #> text1 : #> [1] "π" "β" "πͺοΈ" "πΈ" stringi::stri_extract_all_regex(txt, "\\p{Emoji_Presentation}") #> [[1]] #> [1] "π" "β" "πͺ" "πΈ" stringi::stri_escape_unicode(types(toks)) #> [1] "\\u00a3" "\\u20ac" "\\U0001f44f" #> [4] "Rock" "on" "\\u2757" #> [7] "\\U0001f4aa\\ufe0f" "\\U0001f3b8"
Created on 2023-10-18 with reprex v2.0.2
I noticed that
\\p{Emoji_Presentation}
is not matching emoji πͺοΈin a test because of the the variant selector "\ufe0f".Created on 2023-10-18 with reprex v2.0.2