openvanilla / McBopomofo

小麥注音輸入法
http://mcbopomofo.openvanilla.org/
MIT License
615 stars 76 forks source link

Add variant characters from TBCL #491

Closed ChiahongHong closed 2 months ago

ChiahongHong commented 2 months ago

此 PR 加入國教院所發布的臺灣華語文能力基準 (Taiwan Benchmarks for the Chinese Language, TBCL) 漢字表中所缺的字,只是剛好全部都是異體字。



pip install pandas openpyxl
import pandas as pd

df = pd.read_excel('臺灣華語文能力基準漢字表_111-09-20.xlsx', index_col=0)
charset = set(df['漢字'])

# deal with cases like 裡/裏 in the same field
for char in charset.copy():
    if '/' in char:
        sub1, sub2 = char.split('/')
        charset.add(sub1)
        charset.add(sub2)
        charset.remove(char)

with open('BPMFBase.txt', 'r', encoding='UTF-8') as f:
    lines = f.readlines()
    BPMFBase = set()
    for line in lines:
        BPMFBase.add(line.split()[0])

diff = charset.difference(BPMFBase)
diff = sorted(diff)
print(diff)
['値', '却', '强', '愼', '擡', '擧', '朶', '烟', '牀', '眞', '羣', '踪', '躱', '鉢', '鷄']
xatier commented 2 months ago

Thank you!

xatier commented 2 months ago

@lukhnos, I noticed that we use utf8 in Source/Data/BPMFBase.txt, would you help update the wiki to mention when to use utf8 and big5 respectively? Thanks!

ChiahongHong commented 2 months ago

@xatier @lukhnos

The following characters in BPMFBase.txt are not included in Big5 but are marked as big5, including extensions such as CP950 and HKSCS.

It seems that this field might have been deprecated?

Char CP950 HKSCS
Yes
Yes
Yes
Yes
Yes
Yes
Yes Yes
Yes Yes
Yes
Yes
Yes
Yes
Yes Yes
Yes
Yes
Yes
Yes
Yes
Yes Yes
Yes
Yes
Yes Yes
Yes
Yes
Yes Yes
Yes
Yes
Yes Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
𡻈 Yes

BTW, there's also an encoding cns (CNS11643) that's not described in the Wiki:

是否是 Big5 編碼包含的字元,如果不是,就會標示為 UTF-8

cat BPMFBase.txt | grep cns
錱 ㄓㄣ zhen 5p cns
蟎 ㄇㄢˇ man3 a03 cns
躼 ㄔㄤˊ chang2 t;6 cns
畑 ㄊㄧㄢˊ tian2 wu06 cns
鱲 ㄌㄚˋ la4 x84 cns
栃 ㄌㄧˋ li4 xu4 cns

BPMFBase = set()

with open('BPMFBase.txt', 'r', encoding='UTF-8') as f:
    for line in f.readlines():
        char, _, _, _, encoding = line.split()
        if encoding == 'big5':
            BPMFBase.add(char)

print('Char\tCP950\tHKSCS')
print('═════════════════════')

for char in sorted(BPMFBase):
    try:
        char.encode('big5')
        continue
    except UnicodeEncodeError:
        print(char, end='\t')

    try:
        char.encode('cp950')
        print('Yes', end='\t')
    except UnicodeEncodeError:
        print(end='\t')

    try:
        char.encode('big5-hkscs')
        print('Yes')
    except UnicodeEncodeError:
        print()
xatier commented 2 months ago

@ChiahongHong thanks for delving into this! This table is certainly helpful.

We should definitely update the wiki if these are not really used and perhaps run a script to remove this field.