openvanilla / McBopomofo

小麥注音輸入法
http://mcbopomofo.openvanilla.org/
MIT License
615 stars 76 forks source link

WIP: Fix more pronunciations #504

Closed xatier closed 1 month ago

xatier commented 1 month ago

This is the follow-up with @ChiahongHong's #503.

When reviewing the PR, I noticed the pattern of inconsistent pronunciations (often incorrect) cases with phrase groups like this:

節省 ㄐㄧㄝˊ ㄕㄥˇ
節省下 ㄐㄧㄝˊ ㄒㄧㄥˇ ㄒㄧㄚˋ
節省下來 ㄐㄧㄝˊ ㄒㄧㄥˇ ㄒㄧㄚˋ ㄌㄞˊ

著色 ㄓㄨㄛˊ ㄙㄜˋ
著色劑 ㄓㄜ˙ ㄙㄜˋ ㄐㄧˋ

遺傳學 ㄧˊ ㄔㄨㄢˊ ㄒㄩㄝˊ
遺傳學家 ㄧˊ ㄓㄨㄢˋ ㄒㄩㄝˊ ㄐㄧㄚ

Thanks to the previous work of sorting the dictionary in order, we can easily locate these issues. I wrote another script to find these cases.

#!/usr/bin/env python

import sys

DICT = "Source/Data/BPMFMappings.txt"

with open(DICT) as f:
    lines = f.readlines()

N = 30

for i in range(len(lines) - N):
    line = lines[i]
    t = line.split()[0]
    u = line.split()[1:]

    # exlucde these
    if "一" in t or "不" in t:
        continue

    if len(t) == 2:
        # iterate through the next N phrases
        for j in range(N):
            t1 = lines[i + j].split()[0]
            u1 = lines[i + j].split()[1:]

            if len(t1) == 2:
                continue

            if t1.startswith(t) and not (u1[0] == u[0] and u1[1] == u[1]):
                print(f"{line.strip()}    {lines[i+j]}", end="")

I went through the ~1300 lines of reports and manually examined them (with Moe dict and other dictionaries). In this PR, I have tried my best to provide consistency and correctness for these phrases. Some (incorrect) pronunciations are commonly used but different from the dictionaries, I preserved those cases for keeping the usability.

Another interesting finding is that these are often ambiguous, I also provided both available forms for them: 亞個得波播怎麼子露埔液雌

xatier commented 1 month ago

@lukhnos please DO NOT merge this just yet, I'd like to rebase to master once #503 is merged.

@ChiahongHong please kindly help spot issues if you get a chance :pray:

Review link: https://github.com/openvanilla/McBopomofo/pull/504/commits/10c4f5efd67878dd047b6624635b7623f2d79652

ChiahongHong commented 1 month ago

以下是從您在 https://github.com/openvanilla/McBopomofo/commit/10c4f5efd67878dd047b6624635b7623f2d79652 的修改中,隨意先挑幾個字詞來測試,如果要保持一致的話,可能需要修改的部分。

或許我們可以分階段來修正,先修改明顯的錯誤讀音就好(像 音樂 ㄧㄣ ㄌㄜˋ)這種的,俗音 / 容錯 / 正音的部分之後分別為各個字來處理~

AUDIT = {
    '音樂': {'ㄧㄣ ㄩㄝˋ'},
    '波': {'ㄅㄛ', 'ㄆㄛ'},
    '亞洲': {'ㄧㄚˇ ㄓㄡ', 'ㄧㄚˋ ㄓㄡ'},
    '亞細亞': {'ㄧㄚˇ ㄒㄧˋ ㄧㄚˇ', 'ㄧㄚˋ ㄒㄧˋ ㄧㄚˋ'},
    '亞軍': {'ㄧㄚˇ ㄐㄩㄣ', 'ㄧㄚˋ ㄐㄩㄣ'},
    '倒出': {'ㄉㄠˇ ㄔㄨ', 'ㄉㄠˋ ㄔㄨ'},
    '液': {'ㄧˋ', 'ㄧㄝˋ'},
    '冠狀': {'ㄍㄨㄢ ㄓㄨㄤˋ', 'ㄍㄨㄢˋ ㄓㄨㄤˋ'},
    '鰻': {'ㄇㄢˊ', 'ㄇㄢˋ'},
    '黏膜': {'ㄋㄧㄢˊ ㄇㄛˊ', 'ㄋㄧㄢˊ ㄇㄛˋ'}
}

data = dict()

with open('BPMFMappings.txt', 'r', encoding='UTF-8') as f:
    for line in f.readlines():
        word, bpmf = line.strip().split(maxsplit=1)
        data.setdefault(word, set()).add(bpmf)

for char, audit_pronuns in AUDIT.items():
    first = True
    for word, bpmfs in data.items():
        if char not in word:
            continue

        index = word.index(char)
        pronuns = set()
        for bpmf in bpmfs:
            bpmf = bpmf.split()
            bpmf = bpmf[index:index+len(char)]
            bpmf = ' '.join(bpmf)
            pronuns.add(bpmf)

        diff = sorted(audit_pronuns.difference(pronuns))
        if len(diff) > 0:
            if first:
                print(f'\n## {char}')
                first = False
            print(f'  - {word}')
            print(f'    - Missing:    ', ', '.join(diff))

        diff = pronuns.difference(audit_pronuns)
        if len(diff) > 0:
            print(f'    - Unexpected: ', ', '.join(diff))

音樂

亞洲

亞細亞

亞軍

倒出

冠狀

黏膜

xatier commented 1 month ago

Good advice, I've fixed the ones you've mentioned (亙剖那). We can then use this change set as a reference and open up a series of PRs for each group respectively.