ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
185 stars 25 forks source link

Any tests for ngram/skip_ngram for the tail of the output? #17

Closed juliasilge closed 8 years ago

juliasilge commented 8 years ago

Thank you so much for your work on this amazing package! It has been so great to work with.

In working on the tidytext package with @dgrtwo, I wrote some tests that involved calling tokenize_ngrams and tokenize_skip_ngrams. My tests as I originally wrote them could pass on Travis and locally on OSX but not on appveyor. Someone helpfully reproduced the same behavior locally on their Windows machine. Specifically, it looks like it is the end of the input text that is causing the problem, i.e., if you are tokenizing by n-grams of 2, what do you with the last word? On Windows, it looks like it is putting that last word in its own token. For tokenize_skip_ngrams with n = 4, k = 2, we end up with 3 extra tokens.

In trying to find out why the behavior is different on Windows vs. other OSs, I ended up looking at your test-ngrams.R and I notice that you only test the head of the output. Have you looked at the tail of the output here and tested it?

dselivanov commented 8 years ago

Please provide small reproducible example. 26 мая 2016 г. 5:51 PM пользователь "Julia Silge" notifications@github.com написал:

Thank you so much for your work on this amazing package! It has been so great to work with.

In working on the tidytext https://github.com/juliasilge/tidytext package with @dgrtwo https://github.com/dgrtwo, I wrote some tests that involved calling tokenize_ngrams and tokenize_skip_ngrams. My tests as I originally wrote them could pass on Travis and locally on OSX but not an appveyor. Someone helpfully reproduced the same behavior locally on their Windows machine. Specifically, it looks like it is the end of the input text that is causing the problem, i.e., if you are tokenizing by n-grams of 2, what do you with the last word? On Windows, it looks like it is putting that last word in its own token. For tokenize_skip_ngrams with n = 4, k = 2, we end up with 3 extra tokens.

In trying to find out why the behavior is different on Windows vs. other OSs, I ended up looking at your test-ngrams.R and I notice that you only test the head of the output. Have you looked at the tail of the output here and tested it?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/lmullen/tokenizers/issues/17

lmullen commented 8 years ago

Thanks for the report @juliasilge. I suspect that the problem has to do with how stringi tokenizes words on different OSes: I've run into this before.

juliasilge commented 8 years ago

Here is the output from the tests for tidytext run on Windows: https://gist.github.com/jhollist/cf3956e3ee38311eae5245d19836301c The relevant failures are numbers 2 and 3.

emily <- paste("Hope is the thing with feathers",
                           "That perches in the soul",
                           "And sings the tune without the words",
                           "And never stops at all ",
                           "And sweetest in the Gale is heard ",
                           "And sore must be the storm ",
                           "That could abash the little Bird",
                           "That kept so many warm ",
                           "I’ve heard it in the chillest land ",
                           "And on the strangest Sea ",
                           "Yet never in Extremity,",
                           "It asked a crumb of me.")

tokenize_ngrams(emily, n = 2)
tokenize_skip_ngrams(emily, n = 5, k = 2)

When I run this locally on OSX, it behaves perfectly as expected but based on the test I wrote that failed on appveyor, I am guessing that if you run this on Windows, the last 1-3 outputs would be not what we want.

I changed my tests so they pass, obviously, :stuck_out_tongue_winking_eye: but I am just trying to figure out the source of the issue.

lmullen commented 8 years ago

Thanks. I'll look into this next week when I am back at the office and have my Windows VM.

On Thu, May 26, 2016 at 11:17 AM, Julia Silge notifications@github.com wrote:

Here is the output from the tests for tidytext run on Windows: https://gist.github.com/jhollist/cf3956e3ee38311eae5245d19836301c

`emily <- paste("Hope is the thing with feathers", "That perches in the soul", "And sings the tune without the words", "And never stops at all ", "And sweetest in the Gale is heard ", "And sore must be the storm ", "That could abash the little Bird", "That kept so many warm ", "I’ve heard it in the chillest land ", "And on the strangest Sea ", "Yet never in Extremity,", "It asked a crumb of me.")

tokenize_ngrams(emily, n = 2) tokenize_skip_ngrams(emily, n = 5, k = 2)`

When I run this locally on OSX, it behaves perfectly as expected but based on the test I wrote that failed on appveyor, I am guessing that if you run this on Windows, the last 1-3 outputs would be not what we want.

I changed my tests so they pass, obviously, 😜 but I am just trying to figure out the source of the issue.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/lmullen/tokenizers/issues/17#issuecomment-221901732

Lincoln Mullen, http://lincolnmullen.com Assistant Professor, Department of History & Art History George Mason University

lmullen commented 8 years ago

@juliasilge It turns out that nothing makes me procrastinate quite so much as installing Windows. :-) It actually turned out to be fairly painless.

Windows does generally return more tokens than on *nix based platforms, but the problem is not with single words at the end of the vector. Rather, the problem comes with splitting words that contain internal punctuation such as apostrophes, when those apostrophes are not ASCII.

Using stringi, since that's what tokenizers uses to get the word tokens:

On Mac:

stri_split_boundaries("I’ve heard it", type = "word", skip_word_none = TRUE)[[1]]
# [1] "I’ve"  "heard" "it"  

Sys.getlocale("LC_ALL")
# [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

On Windows:

stri_split_boundaries("I’ve heard it", type = "word", skip_word_none = TRUE)[[1]]
# [1] "I"     "ve"    "heard" "it"

Sys.getlocale("LC_ALL")
# [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

I've tried a number of different options, such as setting the encoding of the string in R, and changing the locale both in the R environment and with stri_opts_brkiter() and I can't get Windows to do the correct thing. My assumption is that Windows is doing the right thing for its locale using Windows CP 1252 instead of UTF-8. And since Windows CP 1252 has already wasted quite a bit of my life at various times and places, I'm going to close this issue, which I think would have to be a fix in stringi anyway.

If you have better information, please feel free to reopen the issue.

dselivanov commented 8 years ago

@lmullen interesting information, thx for sharing.

juliasilge commented 8 years ago

I finally got a chance to sit down and think about this again. Thanks so much for discovering the root of my puzzlement, @lmullen. I'll, at the very least, be able to keep this in mind when using n-grams and/or writing tests.

patperry commented 7 years ago

I might have fixed this in the development version of corpus, but I haven't done much testing (text_tokens gives the same results on both platforms with @lmullen's sample text); see https://github.com/patperry/r-corpus/issues/5 .

The basic problem is that enc2utf8 is buggy on Windows.