In Chinese text, characters (not spaces) provide an approximate solution for word tokenization.
That is, every Chinese character can be treated as if it is a token (or word). While this may not produce an ideal tokenization, it is better than the available alternatives (short of using NLP algorithms).
Tokenizing each character as a "word" is necessary for allowing users to recognize resonance between characters, even when they are part of a compound word, which does occur in Chinese.
What to do
Tokens. When importing Chinese text data into Rezonator, tokenize words based on characters (not spaces). That is, every Chinese character is treated as if it is a token (or word).
Units. To identify Units in Chinese text, use punctuation. To create punctuation Units, use the following Unicode characters (hex values), which represent punctuation characters for Chinese:
0x3001
0x3002
0xFF0C
0xFF1A
0xFF1B
Additional context
If a Chinese text contains words written in Latin script, these should be tokenized based on spaces.
Background
That is, every Chinese character can be treated as if it is a token (or word). While this may not produce an ideal tokenization, it is better than the available alternatives (short of using NLP algorithms).
What to do
Additional context If a Chinese text contains words written in Latin script, these should be tokenized based on spaces.