Closed patperry closed 6 years ago
This affects users on Windows trying to parse text in, for example, Russian, Estonian (juliasilge/tidytext#80), and Polish (juliasilge/tidytext#63).
Thanks for the bug report.
@juliasilge Has anyone given you a reproducible example? Or has anyone indicated that they are willing to test this out if I can find a fix? I don't have a Windows environment at the moment, and without someone to test it I will have to defer this till the next time I'm making major revisions to tokenizers.
The issues on tidytext linked above (Russian, Estonian, Polish) all contain reproducible examples but of course they all work on my local machines, which are OSX and Linux. @andreypanchenko (who submitted the Russian issue yesterday) would you be willing to test out possible fixes in the near future here for the tokenizers repo?
There's a reprex in my original report:
tokenize_ngrams("César Moreira Nuñez", n = 1)
Here are the results. Sorry for the screenshot. I run windows in a VM and I haven't figured out how to copy-paste.
You should be able to test the fix on Mac or Linux. The expected behavior is to see the following:
> Encoding(tokenize_ngrams("César Moreira Nuñez", n = 1)[[1]])
[1] "UTF-8" "unknown" "UTF-8"
We had this problem in early stages of quanteda::tokens_ngrams()
but fixed it in the to/from C++ travel. @koheiw knows better than I, but I think it was in https://github.com/quanteda/quanteda/commit/1b3a03beb76c4328446b553ce1b0d60d1b230dd3#diff-12925d3bcd1fefdaf12f5e7e3a1e41e1.
> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.3
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tokenizers_0.1.4 spacyr_0.9.6 quanteda_1.0.6
> tokenizers::tokenize_ngrams("César Moreira Nuñez", n = 1)
[[1]]
[1] "césar" "moreira" "nuñez"
> quanteda::tokens("César Moreira Nuñez", ngrams = 1)
tokens from 1 document.
text1 :
[1] "César" "Moreira" "Nuñez"
> quanteda::tokens("César Moreira Nuñez", ngrams = 2)
tokens from 1 document.
text1 :
[1] "César_Moreira" "Moreira_Nuñez"
If you do not set encoding of elements of character vectors in R or in C++, non-ASCII characters does not appear on Windows properly.
> sessionInfo()$platform
[1] "x86_64-w64-mingw32/x64 (64-bit)"
>
> tokenize_ngrams("César Moreira Nuñez", n = 2)
[[1]]
[1] "césar moreira" "moreira nuñez"
> tokens("César Moreira Nuñez", ngram = 2)
tokens from 1 document.
text1 :
[1] "César_Moreira" "Moreira_Nuñez"
My solution is https://github.com/quanteda/quanteda/blob/fdd82aa9474ba37c2d4845f96457ddc084ea562d/src/recompile.h#L39-L47
Thank you all for these examples and code. I'll work on this.
As I noted in my original comment, this is probably a 1-line fix. Change output[i] = holding[i];
to something like output[i] = String(holding[i], CE_UTF8);
.
Thank you all, especially @patperry. This should be fixed now.
@juliasilge I'm guessing that most people affected by this are tidytext users. I'm hoping to push a new release (v0.2.0) to CRAN soonish so they can get this bug fix without needing to install the dev version. Feel free to let me know if you need me to do anything re tidytext.
@juliasilge @lmullen @patperry Yes, I've just tested and it worked for Turkish (on Windows machine). Thank you so much!
Otherwise it defaults to "unknown", which is correct on MacOS and Linux (UTF-8) but incorrect on Windows. The following fails on Windows:
see https://stackoverflow.com/questions/47715807/does-tidytextunnest-tokens-works-with-spanish-characters/47758574#47758574 .
The line
output[i] = holding[i];
at https://github.com/ropensci/tokenizers/blob/master/src/skip_ngrams.cpp#L47 seems suspect to me. It should be something like(But I'm not an RCpp user so that may not work or there may be a better way to declare the encoding.)