Why could tokenizers split Chinese as well?

ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text

https://docs.ropensci.org/tokenizers

Other

184 stars 25 forks source link

Why could tokenizers split Chinese as well? #73

Closed hope-data-science closed 4 years ago

hope-data-science commented 4 years ago

I have used tokenize_words to split Chinese and I find that it works to some extent, therefore I want to know why it works. And is there any way to build a dictionary and improve its performance?

lmullen commented 4 years ago

@hope-data-science I would not rely on tokenizers to split non-Western languages without verifying that it is doing so correctly. You can read the source code for the tokenizers to see how it works under the hood. It mostly uses the stringi package, which has strong Unicode support. I'm guessing that is why it works.

If you would like to send a pull request improving support you are welcome to do so.