Closed hope-data-science closed 4 years ago
@hope-data-science I would not rely on tokenizers to split non-Western languages without verifying that it is doing so correctly. You can read the source code for the tokenizers to see how it works under the hood. It mostly uses the stringi package, which has strong Unicode support. I'm guessing that is why it works.
If you would like to send a pull request improving support you are welcome to do so.
I have used tokenize_words to split Chinese and I find that it works to some extent, therefore I want to know why it works. And is there any way to build a dictionary and improve its performance?