ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
184 stars 25 forks source link

Specify encoding in C++ code for skip_ngrams #58

Closed patperry closed 6 years ago

patperry commented 6 years ago

Otherwise it defaults to "unknown", which is correct on MacOS and Linux (UTF-8) but incorrect on Windows. The following fails on Windows:

tokenize_ngrams("César Moreira Nuñez", n = 1)

see https://stackoverflow.com/questions/47715807/does-tidytextunnest-tokens-works-with-spanish-characters/47758574#47758574 .

The line output[i] = holding[i]; at https://github.com/ropensci/tokenizers/blob/master/src/skip_ngrams.cpp#L47 seems suspect to me. It should be something like

output[i] = String(holding[i], CE_UTF8);

(But I'm not an RCpp user so that may not work or there may be a better way to declare the encoding.)

juliasilge commented 6 years ago

This affects users on Windows trying to parse text in, for example, Russian, Estonian (juliasilge/tidytext#80), and Polish (juliasilge/tidytext#63).

lmullen commented 6 years ago

Thanks for the bug report.

@juliasilge Has anyone given you a reproducible example? Or has anyone indicated that they are willing to test this out if I can find a fix? I don't have a Windows environment at the moment, and without someone to test it I will have to defer this till the next time I'm making major revisions to tokenizers.

juliasilge commented 6 years ago

The issues on tidytext linked above (Russian, Estonian, Polish) all contain reproducible examples but of course they all work on my local machines, which are OSX and Linux. @andreypanchenko (who submitted the Russian issue yesterday) would you be willing to test out possible fixes in the near future here for the tokenizers repo?

patperry commented 6 years ago

There's a reprex in my original report:

tokenize_ngrams("César Moreira Nuñez", n = 1)
patperry commented 6 years ago

Here are the results. Sorry for the screenshot. I run windows in a VM and I haven't figured out how to copy-paste.

screen shot 2018-03-02 at 1 54 53 pm
patperry commented 6 years ago

You should be able to test the fix on Mac or Linux. The expected behavior is to see the following:

> Encoding(tokenize_ngrams("César Moreira Nuñez", n = 1)[[1]])
[1] "UTF-8" "unknown" "UTF-8"
kbenoit commented 6 years ago

We had this problem in early stages of quanteda::tokens_ngrams() but fixed it in the to/from C++ travel. @koheiw knows better than I, but I think it was in https://github.com/quanteda/quanteda/commit/1b3a03beb76c4328446b553ce1b0d60d1b230dd3#diff-12925d3bcd1fefdaf12f5e7e3a1e41e1.

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tokenizers_0.1.4 spacyr_0.9.6     quanteda_1.0.6  

> tokenizers::tokenize_ngrams("César Moreira Nuñez", n = 1)
[[1]]
[1] "césar"  "moreira" "nuñez" 

> quanteda::tokens("César Moreira Nuñez", ngrams = 1)
tokens from 1 document.
text1 :
[1] "César"   "Moreira" "Nuñez"  

> quanteda::tokens("César Moreira Nuñez", ngrams = 2)
tokens from 1 document.
text1 :
[1] "César_Moreira" "Moreira_Nuñez"
koheiw commented 6 years ago

If you do not set encoding of elements of character vectors in R or in C++, non-ASCII characters does not appear on Windows properly.

> sessionInfo()$platform
[1] "x86_64-w64-mingw32/x64 (64-bit)"
> 
> tokenize_ngrams("César Moreira Nuñez", n = 2)
[[1]]
[1] "césar moreira" "moreira nuñez"

> tokens("César Moreira Nuñez", ngram = 2)
tokens from 1 document.
text1 :
[1] "César_Moreira" "Moreira_Nuñez"

My solution is https://github.com/quanteda/quanteda/blob/fdd82aa9474ba37c2d4845f96457ddc084ea562d/src/recompile.h#L39-L47

lmullen commented 6 years ago

Thank you all for these examples and code. I'll work on this.

patperry commented 6 years ago

As I noted in my original comment, this is probably a 1-line fix. Change output[i] = holding[i]; to something like output[i] = String(holding[i], CE_UTF8); .

lmullen commented 6 years ago

Thank you all, especially @patperry. This should be fixed now.

@juliasilge I'm guessing that most people affected by this are tidytext users. I'm hoping to push a new release (v0.2.0) to CRAN soonish so they can get this bug fix without needing to install the dev version. Feel free to let me know if you need me to do anything re tidytext.

meczupevi commented 6 years ago

@juliasilge @lmullen @patperry Yes, I've just tested and it worked for Turkish (on Windows machine). Thank you so much!