johnwdubois / rezonator

Rezonator: Dynamics of human engagement
34 stars 1 forks source link

Tokenize Chinese on characters #753

Open johnwdubois opened 3 years ago

johnwdubois commented 3 years ago

Background

  1. In Chinese text, characters (not spaces) provide an approximate solution for word tokenization.
    That is, every Chinese character can be treated as if it is a token (or word). While this may not produce an ideal tokenization, it is better than the available alternatives (short of using NLP algorithms).
  2. Tokenizing each character as a "word" is necessary for allowing users to recognize resonance between characters, even when they are part of a compound word, which does occur in Chinese.

What to do

  1. Tokens. When importing Chinese text data into Rezonator, tokenize words based on characters (not spaces). That is, every Chinese character is treated as if it is a token (or word).
  2. Units. To identify Units in Chinese text, use punctuation. To create punctuation Units, use the following Unicode characters (hex values), which represent punctuation characters for Chinese:
    • 0x3001
    • 0x3002
    • 0xFF0C
    • 0xFF1A
    • 0xFF1B

Additional context If a Chinese text contains words written in Latin script, these should be tokenized based on spaces.

hopesu21 commented 3 years ago

Chinese and Japanese and anything with CJK are tokenized per character. Should work best with Plain text and song and verse. Testing in progress.