qinwf / re2r

RE2 Regular Expression in R.
https://qinwenfeng.com/re2r_doc
Other
98 stars 15 forks source link

Match failure when LC_COLLATE is not UTF-8 #5

Open gagolews opened 8 years ago

gagolews commented 8 years ago

e.g., Windows does not have a UTF-8 locale set by default

gagolews commented 8 years ago

Now the behavior is incorrect:

[gagolews@zeus tmp]$ LC_ALL="pl_PL.iso-8859-2" R

R Under development (unstable) (2016-04-14 r70486) -- "Unsuffered Consequences"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

> library("stringi")
> x <- stri_conv("a\u0105bc", "UTF-8", "")
> library(re2r) 
> re2_match("\u0105", x)
[1] FALSE
> re2_match(x, "\u0105")
B��D: invalid UTF-8 in regexp: 
> stri_extract_all_regex(x, "\u0105")    # this is OK
[[1]]
[1] "�"

consider converting all input strings to utf8, preferably with `stringi::stri_enc_toutf8``