Open kbenoit opened 7 years ago
While we (i.e. Wouter van Atteveldt and I) haven't really looked into Chinese, Japanese and Korean texts yet, I would love to see how other package currently deal with this. For AmCAT (non-R software), which uses Elasticsearch, we were recently looking into the icu_tokenizer which seemed to do well (though it's hard for me to validate, not speaking the language and all).
I therefore suspect that stringi (which uses ICU) should work well too, but would appreciate to hear more from people who actually tried this.
For tokenizing words and sentences in stringi, take a look at the stri_split_boundaries
function (among others). I don't speak any of C-J-K, so a feedback is needed.
Actually our tests have shown that stri_split_boundaries()
(which is used by both tokenizers and quanteda) works very well for Japanese and Chinese word segmentation. I was thinking that a "CJK" roundtable would demonstrate some of that as well as identify other challenges perhaps more difficult to solve.
I'd be glad to modify the tokenizers package, or to write new documentation, so that it can work with non-Western languages. I'd just need to collaborate with someone who knows the language.
What if we set out a series of text texts in Chinese, Japanese, and Korean for testing by various packages. The challenge could be, as applicable: