Add variant characters from TBCL

ChiahongHong commented 2 months ago

此 PR 加入國教院所發布的臺灣華語文能力基準 (Taiwan Benchmarks for the Chinese Language, TBCL) 漢字表中所缺的字，只是剛好全部都是異體字。

鷄 (雞) https://dict.revised.moe.edu.tw/dictView.jsp?ID=87742
踪 (蹤) https://dict.revised.moe.edu.tw/dictView.jsp?ID=140255
擡 (抬) https://dict.revised.moe.edu.tw/dictView.jsp?ID=49515
眞 (真) https://dict.revised.moe.edu.tw/dictView.jsp?ID=116408
値 (值) https://dict.revised.moe.edu.tw/dictView.jsp?ID=113518
愼 (慎) https://dict.revised.moe.edu.tw/dictView.jsp?ID=131685
羣 (群) https://dict.revised.moe.edu.tw/dictView.jsp?ID=103808
烟 (煙) https://dict.revised.moe.edu.tw/dictView.jsp?ID=154841
朶 (朵) https://dict.revised.moe.edu.tw/dictView.jsp?ID=47940
躱 (躲) https://dict.revised.moe.edu.tw/dictView.jsp?ID=47948
强 (強) https://dict.revised.moe.edu.tw/dictView.jsp?ID=95422
却 (卻) https://dict.revised.moe.edu.tw/dictView.jsp?ID=103351
牀 (床) https://dict.revised.moe.edu.tw/dictView.jsp?ID=126489
擧 (舉) https://dict.revised.moe.edu.tw/dictView.jsp?ID=96768
鉢 (缽) https://dict.revised.moe.edu.tw/dictView.jsp?ID=12947

參考指引、技術報告及字詞表等文件下載
- https://coct.naer.edu.tw/download/tech_report/
- 臺灣華語文能力基準漢字表_111-09-20.xlsx

pip install pandas openpyxl

import pandas as pd

df = pd.read_excel('臺灣華語文能力基準漢字表_111-09-20.xlsx', index_col=0)
charset = set(df['漢字'])

# deal with cases like 裡／裏 in the same field
for char in charset.copy():
    if '／' in char:
        sub1, sub2 = char.split('／')
        charset.add(sub1)
        charset.add(sub2)
        charset.remove(char)

with open('BPMFBase.txt', 'r', encoding='UTF-8') as f:
    lines = f.readlines()
    BPMFBase = set()
    for line in lines:
        BPMFBase.add(line.split()[0])

diff = charset.difference(BPMFBase)
diff = sorted(diff)
print(diff)

['値', '却', '强', '愼', '擡', '擧', '朶', '烟', '牀', '眞', '羣', '踪', '躱', '鉢', '鷄']

xatier commented 2 months ago

Thank you!

xatier commented 2 months ago

@lukhnos, I noticed that we use utf8 in Source/Data/BPMFBase.txt, would you help update the wiki to mention when to use utf8 and big5 respectively? Thanks!

ChiahongHong commented 2 months ago

@xatier @lukhnos

The following characters in BPMFBase.txt are not included in Big5 but are marked as big5, including extensions such as CP950 and HKSCS.

It seems that this field might have been deprecated?

Char	CP950	HKSCS
〇		Yes
亖
兲
叁		Yes
叄
咔		Yes
喆		Yes
坂		Yes
坔		Yes
墻	Yes	Yes
夶
嫺	Yes	Yes
寗		Yes
尛
弌		Yes
弍		Yes
弎		Yes
彞
恒	Yes	Yes
晧		Yes
温		Yes
焿
犇		Yes
着		Yes
砈		Yes
碁	Yes	Yes
礴		Yes
竈		Yes
粧	Yes	Yes
絝		Yes
綉		Yes
裏	Yes	Yes
酶		Yes
醩		Yes
銹	Yes	Yes
鍅		Yes
闘
鬦
鬪		Yes
鬬
鬭		Yes
魩		Yes
鮟		Yes
鱇
鴴		Yes
麪		Yes
黒
𡻈		Yes

BTW, there's also an encoding cns (CNS11643) that's not described in the Wiki:

是否是 Big5 編碼包含的字元，如果不是，就會標示為 UTF-8

cat BPMFBase.txt | grep cns

錱 ㄓㄣ zhen 5p cns
蟎 ㄇㄢˇ man3 a03 cns
躼 ㄔㄤˊ chang2 t;6 cns
畑 ㄊㄧㄢˊ tian2 wu06 cns
鱲 ㄌㄚˋ la4 x84 cns
栃 ㄌㄧˋ li4 xu4 cns

BPMFBase = set()

with open('BPMFBase.txt', 'r', encoding='UTF-8') as f:
    for line in f.readlines():
        char, _, _, _, encoding = line.split()
        if encoding == 'big5':
            BPMFBase.add(char)

print('Char\tCP950\tHKSCS')
print('═════════════════════')

for char in sorted(BPMFBase):
    try:
        char.encode('big5')
        continue
    except UnicodeEncodeError:
        print(char, end='\t')

    try:
        char.encode('cp950')
        print('Yes', end='\t')
    except UnicodeEncodeError:
        print(end='\t')

    try:
        char.encode('big5-hkscs')
        print('Yes')
    except UnicodeEncodeError:
        print()

xatier commented 2 months ago

@ChiahongHong thanks for delving into this! This table is certainly helpful.

We should definitely update the wiki if these are not really used and perhaps run a script to remove this field.

openvanilla / McBopomofo

Add variant characters from TBCL #491