hosxy / rime-aurora-pinyin

【极光拼音】输入方案
Apache License 2.0
70 stars 3 forks source link

Add frequency of single characters #2

Closed ayaka14732 closed 4 years ago

ayaka14732 commented 4 years ago

This pull request adds frequency data to the single characters.

Before modification: (ni) 拟 > 你 After modification: (ni) 你 > 拟

Approach:

The Unicode Han Database (Unihan) contains a field kHanyuPinlu, which is the frequency data of single characters. If there are multiple pronunciations of a character in Putonghua (Mandarin Chinese), the frequency is calculated separately for each pronunciation.

However, there are 8,105 characters in Tongyong Guifan Hanzi Biao (通用规范汉字表), while there are no more than 3,000 characters in kHanyuPinlu. This means the frequency data is not large enough.

To solve this, the original (字, 拼音) data is used:

Sample processing script:

Download Unihan data:

pip install unihan-etl
unihan-etl -F json -f kHanyuPinlu -d ./kHanyuPinlu.json

Generate dictionary:

from collections import defaultdict
import json

orig = 'āáǎàōóǒòēéěèīíǐìūúǔùüǖǘǚǜḿńňǹ'
subs = 'aaaaooooeeeeiiiiuuuuvvvvvmnnn'
trans_remove_tone = str.maketrans(orig, subs)

def remove_tone(s):
    '''
    >>> remove_tone('sān')
    'san'
    '''
    return s.translate(trans_remove_tone)

d = defaultdict(int)

with open('kHanyuPinlu.json') as f:
    for char_item in json.load(f):
        ch = char_item['char']
        for py_item in char_item['kHanyuPinlu']:
            py = remove_tone(py_item['phonetic'])
            freq = py_item['frequency']
            d[ch, py] += freq  # load frequency data from kHanyuPinlu

with open('original.txt') as f, open('modified.txt', 'w') as g:
    for line in f:
        ch, py, freq = line.rstrip().split('\t')
        freq = {'10': 30, '5': 10, '1': 5, '1g': 5}[freq]  # assign default frequency
        freq = d.get((ch, py), freq)  # if not found, use the default frequency
        print(ch, py, freq, file=g, sep='\t')
CoelacanthusHex commented 4 years ago

好耶