quanteda / quanteda.textstats

Textual statistics for quanteda
GNU General Public License v3.0
14 stars 2 forks source link

textstat_summary() error in Ubuntu 16.04 #35

Closed MarianoRico closed 3 years ago

MarianoRico commented 3 years ago

just try this:

library(quanteda)
textstat_summary(data_corpus_inaugural[10:15])

I get this error:

Error in stri_detect_regex(types_search, pattern, case_insensitive = case_insensitive) : 
  Illegal argument. (U_ILLEGAL_ARGUMENT_ERROR, context=`^\p{emoji_presentation}+$`)

I have used CRAN version 3.0.0 as well as the latest development version. Same error in both.

kbenoit commented 3 years ago

Yes I'm seeing this too, in Ubuntu 16.10 LTS:

> library("quanteda")
Package version: 3.0.0
Unicode version: 7.0
ICU version: 55.1
Parallel computing: 2 of 2 threads used.
See https://quanteda.io for tutorials and examples.
> library("quanteda.textstats")
> textstat_summary(data_corpus_inaugural[10:15])
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'as.list': error in evaluating the argument 'x' in selecting a method for function 'which': Illegal argument. (U_ILLEGAL_ARGUMENT_ERROR, context=`^\p{emoji_presentation}+$`)

The problem is that the Unicode library is really old on 16.04. On macOS 10.15.7, for instance, it's:

Loading required package: quanteda
Package version: 3.0.0
Unicode version: 10.0
ICU version: 61.1

I'd suggest you update to Ubuntu 20.04. But we can also issue a patch to not look for this emoji pattern for older versions of Unicode / ICU, once I find out when they were introduced.

MarianoRico commented 3 years ago

Dear Kenneth,

if the problem are the emoticons, what about an argument to select which stats are returned?. I propose to use an argument skip. By default skip=NULL returns all, but we could specify something like skip=c(emoticons, exclamations) to avoid stats about those types.

kbenoit commented 3 years ago

Should be fixed in the master now, and we will update CRAN today too. We did not implement the skipping of fields, since we want the return to have the same shape. But on systems with older ICU versions, it now returns NA for the emoji counts.