nk2028 / tshet-uinh-data

Data of the Qieyun phonological system
Creative Commons Zero v1.0 Universal
14 stars 7 forks source link

Change data format #2

Closed ayaka14732 closed 3 years ago

ayaka14732 commented 3 years ago

Change data format, making use of QieyunEncoder v0.2.x. The new format is easier to maintain.

Sample build script:

Input: 廣韻(20170209).xls, preprocessed into .csv format, containing these columns

廣韻反切(覈校後),廣韻字頭(覈校後),廣韻釋義,釋義補充,聲紐,呼,等,韻部(調整後),聲調

Python script:

from QieyunEncoder import to描述

with open('src.csv') as f, open('data.csv', 'w') as g:
    # skip header
    next(f)

    for line in f:
        try:
            反切, 字頭, 解釋, 補充, 母, 呼, 等, 韻, 聲 = line.rstrip('\n').split(',')
        except Exception:
            print(line)

        # 拆分重紐和韻
        重紐 = 韻[1:]
        韻 = 韻[:1]

        # 異體字
        if 母 == '群':
            母 = '羣'
        elif 母 == '娘':
            母 = '孃'

        if 韻 == '真':
            韻 = '眞'

        # 刪除羨餘屬性
        if not (母 in '幫滂並明見溪羣疑影曉' and 韻 in '支脂祭眞仙宵清侵鹽'):
            重紐 = None
        if 母 in '幫滂並明' or 韻 in '東冬鍾江虞模尤幽':
            呼 = None

        # 無反切的小韻
        if len(反切) != 2:
            反切 = ''

        描述 = to描述(母, 呼, 等, 重紐, 韻, 聲)

        print(描述, 反切, 字頭, 解釋, sep=',', file=g)