Closed juliasilge closed 8 years ago
Please provide small reproducible example. 26 мая 2016 г. 5:51 PM пользователь "Julia Silge" notifications@github.com написал:
Thank you so much for your work on this amazing package! It has been so great to work with.
In working on the tidytext https://github.com/juliasilge/tidytext package with @dgrtwo https://github.com/dgrtwo, I wrote some tests that involved calling tokenize_ngrams and tokenize_skip_ngrams. My tests as I originally wrote them could pass on Travis and locally on OSX but not an appveyor. Someone helpfully reproduced the same behavior locally on their Windows machine. Specifically, it looks like it is the end of the input text that is causing the problem, i.e., if you are tokenizing by n-grams of 2, what do you with the last word? On Windows, it looks like it is putting that last word in its own token. For tokenize_skip_ngrams with n = 4, k = 2, we end up with 3 extra tokens.
In trying to find out why the behavior is different on Windows vs. other OSs, I ended up looking at your test-ngrams.R and I notice that you only test the head of the output. Have you looked at the tail of the output here and tested it?
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/lmullen/tokenizers/issues/17
Thanks for the report @juliasilge. I suspect that the problem has to do with how stringi tokenizes words on different OSes: I've run into this before.
Here is the output from the tests for tidytext run on Windows: https://gist.github.com/jhollist/cf3956e3ee38311eae5245d19836301c The relevant failures are numbers 2 and 3.
emily <- paste("Hope is the thing with feathers",
"That perches in the soul",
"And sings the tune without the words",
"And never stops at all ",
"And sweetest in the Gale is heard ",
"And sore must be the storm ",
"That could abash the little Bird",
"That kept so many warm ",
"I’ve heard it in the chillest land ",
"And on the strangest Sea ",
"Yet never in Extremity,",
"It asked a crumb of me.")
tokenize_ngrams(emily, n = 2)
tokenize_skip_ngrams(emily, n = 5, k = 2)
When I run this locally on OSX, it behaves perfectly as expected but based on the test I wrote that failed on appveyor, I am guessing that if you run this on Windows, the last 1-3 outputs would be not what we want.
I changed my tests so they pass, obviously, :stuck_out_tongue_winking_eye: but I am just trying to figure out the source of the issue.
Thanks. I'll look into this next week when I am back at the office and have my Windows VM.
On Thu, May 26, 2016 at 11:17 AM, Julia Silge notifications@github.com wrote:
Here is the output from the tests for tidytext run on Windows: https://gist.github.com/jhollist/cf3956e3ee38311eae5245d19836301c
`emily <- paste("Hope is the thing with feathers", "That perches in the soul", "And sings the tune without the words", "And never stops at all ", "And sweetest in the Gale is heard ", "And sore must be the storm ", "That could abash the little Bird", "That kept so many warm ", "I’ve heard it in the chillest land ", "And on the strangest Sea ", "Yet never in Extremity,", "It asked a crumb of me.")
tokenize_ngrams(emily, n = 2) tokenize_skip_ngrams(emily, n = 5, k = 2)`
When I run this locally on OSX, it behaves perfectly as expected but based on the test I wrote that failed on appveyor, I am guessing that if you run this on Windows, the last 1-3 outputs would be not what we want.
I changed my tests so they pass, obviously, 😜 but I am just trying to figure out the source of the issue.
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/lmullen/tokenizers/issues/17#issuecomment-221901732
Lincoln Mullen, http://lincolnmullen.com Assistant Professor, Department of History & Art History George Mason University
@juliasilge It turns out that nothing makes me procrastinate quite so much as installing Windows. :-) It actually turned out to be fairly painless.
Windows does generally return more tokens than on *nix based platforms, but the problem is not with single words at the end of the vector. Rather, the problem comes with splitting words that contain internal punctuation such as apostrophes, when those apostrophes are not ASCII.
Using stringi, since that's what tokenizers uses to get the word tokens:
On Mac:
stri_split_boundaries("I’ve heard it", type = "word", skip_word_none = TRUE)[[1]]
# [1] "I’ve" "heard" "it"
Sys.getlocale("LC_ALL")
# [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
On Windows:
stri_split_boundaries("I’ve heard it", type = "word", skip_word_none = TRUE)[[1]]
# [1] "I" "ve" "heard" "it"
Sys.getlocale("LC_ALL")
# [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
I've tried a number of different options, such as setting the encoding of the string in R, and changing the locale both in the R environment and with stri_opts_brkiter()
and I can't get Windows to do the correct thing. My assumption is that Windows is doing the right thing for its locale using Windows CP 1252 instead of UTF-8. And since Windows CP 1252 has already wasted quite a bit of my life at various times and places, I'm going to close this issue, which I think would have to be a fix in stringi anyway.
If you have better information, please feel free to reopen the issue.
@lmullen interesting information, thx for sharing.
I finally got a chance to sit down and think about this again. Thanks so much for discovering the root of my puzzlement, @lmullen. I'll, at the very least, be able to keep this in mind when using n-grams and/or writing tests.
I might have fixed this in the development version of corpus, but I haven't done much testing (text_tokens
gives the same results on both platforms with @lmullen's sample text); see https://github.com/patperry/r-corpus/issues/5 .
The basic problem is that enc2utf8
is buggy on Windows.
Thank you so much for your work on this amazing package! It has been so great to work with.
In working on the tidytext package with @dgrtwo, I wrote some tests that involved calling
tokenize_ngrams
andtokenize_skip_ngrams
. My tests as I originally wrote them could pass on Travis and locally on OSX but not on appveyor. Someone helpfully reproduced the same behavior locally on their Windows machine. Specifically, it looks like it is the end of the input text that is causing the problem, i.e., if you are tokenizing by n-grams of 2, what do you with the last word? On Windows, it looks like it is putting that last word in its own token. Fortokenize_skip_ngrams
with n = 4, k = 2, we end up with 3 extra tokens.In trying to find out why the behavior is different on Windows vs. other OSs, I ended up looking at your
test-ngrams.R
and I notice that you only test thehead
of the output. Have you looked at thetail
of the output here and tested it?