microsoft / microsoft-r-open

Microsoft R Open Source
212 stars 69 forks source link

PCRE not support unicode pattern makes jiebaR failed running on MS -r 3.3.1 #12

Closed alexwwang closed 7 years ago

alexwwang commented 7 years ago

when using locally installed jiebaR in MS r 3.3.1, met errors below:

> test_worker <- worker('tag')
> test_worker <= '这是一个测试句子。'
Error in grep("(*UCP)^[^⺀- 〡-﹏a-zA-Z0-9]*$", result, perl = TRUE, :
invalid regular expression '(*UCP)^[^⺀- 〡-﹏a-zA-Z0-9]*$'
In addition: Warning message:
In grep("(*UCP)^[^⺀- 〡-﹏a-zA-Z0-9]*$", result, perl = TRUE, :
PCRE pattern compilation error
'this version of PCRE is not compiled with Unicode property support'
at '(*UCP)^[^⺀- 〡-﹏a-zA-Z0-9]*$'

reference from: jiebaR#43

CRAN

PCRE must be built with UTF-8 support (not the default, and checked by @command{configure}) and support for Unicode properties is assumed by some @R{} packages.

https://github.com/wch/r-source/blob/ceeebfaccc10fdf946920cef641d6efbd64bab59/doc/manual/R-admin.texi#L3289-L3302

MRO:

# Handle --enable-unicode-properties
AC_ARG_ENABLE(unicode-properties,
              AS_HELP_STRING([--enable-unicode-properties],
                             [enable Unicode properties support (implies --enable-utf)]),
, enable_unicode_properties=no)

https://github.com/Microsoft/microsoft-r-open/blob/6d2bbb4fc9ed0b6a5212a4694de4c14dd48d25f1/vendor/pcre-8.37/configure.ac#L182-L186

nathansoz commented 7 years ago

I have fixed these flags in our prerelease build of MRO 3.3.2 and now get the following:

test_worker <- worker('tag') test_worker <= '这是一个测试句子。' x m vn n "这是" "一个" "测试" "句子"

Does that look right?

alexwwang commented 7 years ago

Yes, exactly. Thank you for that. :)

nathansoz commented 7 years ago

Great. We will be releasing 3.3.2 in November, shortly after CRAN releases it.

alexwwang commented 7 years ago

Thanks. :)