This pull request adds frequency data to the single characters.
Before modification: (ni) 拟 > 你
After modification: (ni) 你 > 拟
Approach:
The Unicode Han Database (Unihan) contains a field kHanyuPinlu, which is the frequency data of single characters. If there are multiple pronunciations of a character in Putonghua (Mandarin Chinese), the frequency is calculated separately for each pronunciation.
However, there are 8,105 characters in Tongyong Guifan Hanzi Biao (通用规范汉字表), while there are no more than 3,000 characters in kHanyuPinlu. This means the frequency data is not large enough.
To solve this, the original (字, 拼音) data is used:
If the frequency of a (字, 拼音) pair can be found in kHanyuPinlu, use the frequency data from kHanyuPinlu.
Otherwise,
If the 字 is a level 1 character (一级字), assign a default frequency of 30.
If the 字 is a level 2 character (二级字), assign a default frequency of 10.
If the 字 is a level 3 character (三级字), assign a default frequency of 5.
from collections import defaultdict
import json
orig = 'āáǎàōóǒòēéěèīíǐìūúǔùüǖǘǚǜḿńňǹ'
subs = 'aaaaooooeeeeiiiiuuuuvvvvvmnnn'
trans_remove_tone = str.maketrans(orig, subs)
def remove_tone(s):
'''
>>> remove_tone('sān')
'san'
'''
return s.translate(trans_remove_tone)
d = defaultdict(int)
with open('kHanyuPinlu.json') as f:
for char_item in json.load(f):
ch = char_item['char']
for py_item in char_item['kHanyuPinlu']:
py = remove_tone(py_item['phonetic'])
freq = py_item['frequency']
d[ch, py] += freq # load frequency data from kHanyuPinlu
with open('original.txt') as f, open('modified.txt', 'w') as g:
for line in f:
ch, py, freq = line.rstrip().split('\t')
freq = {'10': 30, '5': 10, '1': 5, '1g': 5}[freq] # assign default frequency
freq = d.get((ch, py), freq) # if not found, use the default frequency
print(ch, py, freq, file=g, sep='\t')
This pull request adds frequency data to the single characters.
Before modification: (ni) 拟 > 你 After modification: (ni) 你 > 拟
Approach:
The Unicode Han Database (Unihan) contains a field
kHanyuPinlu
, which is the frequency data of single characters. If there are multiple pronunciations of a character in Putonghua (Mandarin Chinese), the frequency is calculated separately for each pronunciation.However, there are 8,105 characters in Tongyong Guifan Hanzi Biao (通用规范汉字表), while there are no more than 3,000 characters in
kHanyuPinlu
. This means the frequency data is not large enough.To solve this, the original (字, 拼音) data is used:
kHanyuPinlu
, use the frequency data fromkHanyuPinlu
.Sample processing script:
Download Unihan data:
Generate dictionary: