Closed hideaki closed 3 years ago
Thanks @hideaki !
Note this also works out of the box using the newest quanteda tokeniser.
library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("testthat")
test_that("tokenizing non-space-separated language works", {
expect_identical(
as.character(tokens("\u4ECA\u65E5\u3082\u3088\u3044\u5929\u6C17\u3002",
remove_punct = TRUE
)),
c("\u4ECA\u65E5", "\u3082", "\u3088\u3044", "\u5929\u6C17")
)
})
## Test passed 🎉
Thanks, @hideaki. I've merged this into master.
tokenize_tweets did not split the input into words if strip_punct is set to TRUE. This PR made the change to split into words even in this case.
Non-space-separated languages like Japanese were not tokenized into words because of this issue. I added a test case for it.
It also fixes #68.