Add frequency of single characters

This pull request adds frequency data to the single characters.

Before modification: (ni) 拟 > 你 After modification: (ni) 你 > 拟

Approach:

The Unicode Han Database (Unihan) contains a field kHanyuPinlu, which is the frequency data of single characters. If there are multiple pronunciations of a character in Putonghua (Mandarin Chinese), the frequency is calculated separately for each pronunciation.

However, there are 8,105 characters in Tongyong Guifan Hanzi Biao (通用规范汉字表), while there are no more than 3,000 characters in kHanyuPinlu. This means the frequency data is not large enough.

To solve this, the original (字, 拼音) data is used:

If the frequency of a (字, 拼音) pair can be found in kHanyuPinlu, use the frequency data from kHanyuPinlu.
Otherwise,
- If the 字 is a level 1 character (一级字), assign a default frequency of 30.
- If the 字 is a level 2 character (二级字), assign a default frequency of 10.
- If the 字 is a level 3 character (三级字), assign a default frequency of 5.

Sample processing script:

Download Unihan data:

pip install unihan-etl
unihan-etl -F json -f kHanyuPinlu -d ./kHanyuPinlu.json

Generate dictionary:

from collections import defaultdict
import json

orig = 'āáǎàōóǒòēéěèīíǐìūúǔùüǖǘǚǜḿńňǹ'
subs = 'aaaaooooeeeeiiiiuuuuvvvvvmnnn'
trans_remove_tone = str.maketrans(orig, subs)

def remove_tone(s):
    '''
    >>> remove_tone('sān')
    'san'
    '''
    return s.translate(trans_remove_tone)

d = defaultdict(int)

with open('kHanyuPinlu.json') as f:
    for char_item in json.load(f):
        ch = char_item['char']
        for py_item in char_item['kHanyuPinlu']:
            py = remove_tone(py_item['phonetic'])
            freq = py_item['frequency']
            d[ch, py] += freq  # load frequency data from kHanyuPinlu

with open('original.txt') as f, open('modified.txt', 'w') as g:
    for line in f:
        ch, py, freq = line.rstrip().split('\t')
        freq = {'10': 30, '5': 10, '1': 5, '1g': 5}[freq]  # assign default frequency
        freq = d.get((ch, py), freq)  # if not found, use the default frequency
        print(ch, py, freq, file=g, sep='\t')

hosxy / rime-aurora-pinyin

Add frequency of single characters #2