ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
185 stars 25 forks source link

`tokenize_ngram` deal with colon(:) inconsistently across different platforms #32

Closed everdark closed 7 years ago

everdark commented 7 years ago

Hi,

Recently I just came across one issue that make me confused. On my macbook I will have the results:

> library(tokenizers)
> tokenize_ngrams("name:kyle", n=1)
[[1]]
[1] "name" "kyle"

However on my Ubuntu machine it becomes

> library(tokenizers)
> tokenize_ngrams("name:kyle", n=1)
[[1]]
[1] "name:kyle"

If there is any space in between the colon and the other words then both case will be two tokens spearated.

My sessionInfo for two machines are as the followings:

R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.3

locale:
[1] C/UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] nvimcom_0.9-26    colorout_1.1-2    magrittr_1.5      data.table_1.10.4

loaded via a namespace (and not attached):
[1] tools_3.3.2
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=C                 LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

Am I missing something about this difference? Is it a locale issue? Any help is appreciated.

lmullen commented 7 years ago

Tokenization functions all call functions from the stringi package, which makes some determinations about word boundaries based on the locale. You might try using stringi::stri_locale_set() to set the locale on both machines to be the same.

lmullen commented 7 years ago

I've looked into this more closely. I can't reproduce the problem on an Ubuntu 16.04 machine, because the locale is the same as on my Mac OS X machine. In any case, I'm sure that the reason for the difference is the locale setting, which stringi picks up. You should use either Sys.setlocale() or stringi::stri_locale_set() to ensure that you are using the same locales on all machines.