Tokenize Chinese on characters

Background

In Chinese text, characters (not spaces) provide an approximate solution for word tokenization.
That is, every Chinese character can be treated as if it is a token (or word). While this may not produce an ideal tokenization, it is better than the available alternatives (short of using NLP algorithms).
Tokenizing each character as a "word" is necessary for allowing users to recognize resonance between characters, even when they are part of a compound word, which does occur in Chinese.

What to do

Tokens. When importing Chinese text data into Rezonator, tokenize words based on characters (not spaces). That is, every Chinese character is treated as if it is a token (or word).
Units. To identify Units in Chinese text, use punctuation. To create punctuation Units, use the following Unicode characters (hex values), which represent punctuation characters for Chinese:
- 0x3001
- 0x3002
- 0xFF0C
- 0xFF1A
- 0xFF1B

Additional context If a Chinese text contains words written in Latin script, these should be tokenized based on spaces.

johnwdubois / rezonator