ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
185 stars 25 forks source link

Split into words in tokenize_tweets even when strip_punct is set to TRUE #78

Closed hideaki closed 3 years ago

hideaki commented 3 years ago

tokenize_tweets did not split the input into words if strip_punct is set to TRUE. This PR made the change to split into words even in this case.

Non-space-separated languages like Japanese were not tokenized into words because of this issue. I added a test case for it.

It also fixes #68.

kbenoit commented 3 years ago

Thanks @hideaki !

Note this also works out of the box using the newest quanteda tokeniser.

library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("testthat")

test_that("tokenizing non-space-separated language works", {
  expect_identical(
    as.character(tokens("\u4ECA\u65E5\u3082\u3088\u3044\u5929\u6C17\u3002",
      remove_punct = TRUE
    )),
    c("\u4ECA\u65E5", "\u3082", "\u3088\u3044", "\u5929\u6C17")
  )
})
## Test passed 🎉
lmullen commented 3 years ago

Thanks, @hideaki. I've merged this into master.