Asian languages roundtable and code challenge

ropensci / textworkshop17

Text Workshop at the London School of Economics, April 2017

21 stars 7 forks source link

Asian languages roundtable and code challenge #3

Open kbenoit opened 7 years ago

kbenoit commented 7 years ago

What if we set out a series of text texts in Chinese, Japanese, and Korean for testing by various packages. The challenge could be, as applicable:

tokenise words and sentences
tag parts of speech
select tokens through pattern matching
create a document-term matrix
attempt analysis using the document-term matrix or other constructed data objects

kasperwelbers commented 7 years ago

While we (i.e. Wouter van Atteveldt and I) haven't really looked into Chinese, Japanese and Korean texts yet, I would love to see how other package currently deal with this. For AmCAT (non-R software), which uses Elasticsearch, we were recently looking into the icu_tokenizer which seemed to do well (though it's hard for me to validate, not speaking the language and all).

I therefore suspect that stringi (which uses ICU) should work well too, but would appreciate to hear more from people who actually tried this.

gagolews commented 7 years ago

For tokenizing words and sentences in stringi, take a look at the stri_split_boundaries function (among others). I don't speak any of C-J-K, so a feedback is needed.

kbenoit commented 7 years ago

Actually our tests have shown that stri_split_boundaries() (which is used by both tokenizers and quanteda) works very well for Japanese and Chinese word segmentation. I was thinking that a "CJK" roundtable would demonstrate some of that as well as identify other challenges perhaps more difficult to solve.

lmullen commented 7 years ago

I'd be glad to modify the tokenizers package, or to write new documentation, so that it can work with non-Western languages. I'd just need to collaborate with someone who knows the language.