openvanilla / McBopomofo

小麥注音輸入法
http://mcbopomofo.openvanilla.org/
MIT License
615 stars 76 forks source link

Fix more phonetics for 不 #464

Closed xatier closed 4 months ago

xatier commented 4 months ago

若「不」後面接一二三聲,該「不」應為四聲 若「不」後面接四聲,該「不」應為二聲 若「不」於結尾,二四聲皆可

保留部分短語詞輸入習慣 (e.g., ㄅㄨˋ ㄏㄨㄟˋ)

xatier commented 4 months ago

@lukhnos apologize for the big PR, I noticed that many of the pronunciations of 不 are incorrect.

I wrote a small script on all 不-related phrases and reviewed them manually. I believe this PR should allow most of Taiwanese users to type 不-related phrases with more natural pronunciations.

with open('a') as f:
    lines = f.readlines()

    for line in lines:
        try:
            line = line.strip()
            k = line.split()

            # can be either 2 or 4
            if k[0].endswith('不'):
                continue

            i = k[0].index('不')
            p = k[1 + i]
            n = k[1 + i + 1]

            # next char is 1 2 3 -> p needs to be 4
            #if (((not n.endswith('ˋ')) and
            #     (not n.endswith('˙')))) and p.endswith('ˊ'):
            #    print(f'{k[0]}, {p} {n}: {line}')
            #    # rep = line.replace('ㄅㄨˊ', 'ㄅㄨˋ')
            #    # print(rep)

            # next char is 4 -> p needs to be 2
            #if n.endswith('ˋ') and p.endswith('ˋ'):
            #    print(f'{k[0]}, {p} {n}: {line}')
            #    # rep = line.replace('ㄅㄨˋ', 'ㄅㄨˊ')
            #    # print(rep)
        except IndexError:
            print(f'no! {line}')
ChiahongHong commented 4 months ago

這個問題是語音學上的「變調」,並不是錯誤 除了「不」之外,相同情況還有「一」與三聲的變調 在教育部國語辭典簡編本中,基本上原調與變調的音都有收

例如:不克 https://dict.concised.moe.edu.tw/dictView.jsp?ID=2059

其餘辭典例如成語典,通常都僅收原調,並不會收錄變調 https://dict.idioms.moe.edu.tw/idiomView.jsp?ID=23335

或許原本若是正確的原調可以予以保留不用刪除,新增變調的音即可

xatier commented 4 months ago

I looked up all the deleted terms (~300 phrases) on dict.concised.moe.edu.tw and it seems the majority of this only applies when the next phrase ends with ˋ.

list=$(cat a.txt)

for w in $list; do
    xdg-open "https://dict.concised.moe.edu.tw/search.jsp?md=1&word=$w#searchL"
done
ChiahongHong commented 4 months ago

Yes, tone sandhi (變調) of occurs only when the following word has a fourth tone. Below is a simple but not necessarily exhaustive script for detecting this condition, provided for reference.

cat BPMFMappings.txt | grep 不 | grep -E "ㄅㄨ[ˊˋ] [ㄅ-ㄩ]{1,3}ˋ"

[ㄅ-ㄩ]: the last phonetic character in the Unicode Bopomofo block is ㄩ

tianjianjiang commented 4 months ago

@xatier @ChiahongHong 謝謝兩位的討論。

BLUF

Context

或許原本若是正確的原調可以予以保留不用刪除,新增變調的音即可

就我記憶所及,這確實是當初針對「一不變調」所開的特例。 然而,有很多詞的原調 (純粹基於習慣) 從來沒收錄過,於是我認為這次 PR 在增進一致性上幫了大忙。 附帶一提,「一不」的輕聲變調也 (純粹基於習慣) 從來沒收錄過。 至於「連上變調」像是「總統」就只仍收錄原調。

Note

我會稍微改一下本 PR 的描述。因為裡面有些很好玩的斷詞歧義。 🤞

xatier commented 4 months ago

Splitting this PR into pieces is a time consuming process and I do not see much real gains out of it. Given that the current file is not sorted in orders.

I'd like to have the PR been merged directly, or I can close this one and someone else making this into smaller chunks.

tianjianjiang commented 4 months ago

Splitting this PR into pieces is a time consuming process and I do not see much real gains out of it. Given that the current file is not sorted in orders.

I'd like to have the PR been merged directly, or I can close this one and someone else making this into smaller chunks.

In that case, then it will take a while for me to verify the consistency. I will temporarily make this PR a draft until I'm confident that I've read every change. +CC @lukhnos @mjhsieh @zonble

tianjianjiang commented 4 months ago

Splitting this PR into pieces is a time consuming process and I do not see much real gains out of it. Given that the current file is not sorted in orders. I'd like to have the PR been merged directly, or I can close this one and someone else making this into smaller chunks.

In that case, then it will take a while for me to verify the consistency. I will temporarily make this PR a draft until I'm confident that I've read every change. +CC @lukhnos @mjhsieh @zonble

@xatier I've annotated several rows as examples for maintaining consistency. It seems feasible that you can automatically change all of them at once, as long as your script is slightly modified for keeping original tones. +CC @lukhnos @mjhsieh @zonble

xatier commented 4 months ago

the only deletion of ㄅㄨˋ:

$ git diff d954057d4d5a35aea358521110a8399214028fd3 | ag '^-' | ag 'ㄅㄨˋ'
-鍥而不舍 ㄑㄧㄝˋ ㄦˊ ㄅㄨˋ ㄕㄜˋ
xatier commented 4 months ago

I tweaked my script to perform some thorough check on all 不 phrases. We should see only one exception after all these commits.

$ python b.py 
[-] 三不政策 ㄙㄢ ㄅㄨˊ ㄓㄥˋ ㄘㄜˋ

三不政策 ㄙㄢ ㄅㄨˋ ㄓㄥˋ ㄘㄜˋ is proper, and we don't need to add 三不政策 ㄙㄢ ㄅㄨˊ ㄓㄥˋ ㄘㄜˋ to the dictionary.

# Usage:
#
# grep 不 Source/Data/BPMFMappings.txt > input.txt
# python bu.py

with open('input.txt') as f:
    lines = f.readlines()
    d = {line.strip() for line in lines}

    if len(d) != len(lines):
        print("dups!")

    # x should be present in d
    def check(x, d):
        if x not in d:
            print(f'[-] {x}')

    # x should not be present in d
    def check_not(x, d):
        if x in d:
            print(f'[-] {x}')

    for line in lines:
        try:
            line = line.strip()
            k = line.split()

            # 若「不」於結尾,二四聲皆可
            if k[0].endswith('不'):
                check(line[:-3] + 'ㄅㄨˊ', d)
                check(line[:-3] + 'ㄅㄨˋ', d)
                continue

            # p: position of 不
            # n: position of the next
            i = k[0].index('不')
            p = k[1 + i]
            n = k[1 + i + 1]

            # 若「不」後面接一二三聲,該「不」應為四聲
            if (((not n.endswith('ˋ')) and
                 (not n.endswith('˙')))) and p.endswith('ˊ'):

                f = line.index('ㄅㄨˊ')
                check(line[:f] + 'ㄅㄨˋ' + line[f + 3:], d)
                check_not(line, d)

            # 若「不」後面接四聲,該「不」應為二聲且四聲不 (原調) 也應存在
            if n.endswith('ˋ') and p.endswith('ˋ'):
                f = line.index('ㄅㄨˋ')
                check(line[:f] + 'ㄅㄨˊ' + line[f + 3:], d)
                check(line[:f] + 'ㄅㄨˋ' + line[f + 3:], d)

        except IndexError:
            print(f'no! {line}')
xatier commented 4 months ago

@lukhnos I believe this PR is ready for the review. I've fixed all the consistency issues.

After this PR gets merged, we should update the wiki [1] with the following:

修正讀音

關於「不」字之音調:請參考下列規則與 PR #464 之討論。

若「不」後面接一二三聲,該「不」應為四聲,僅加入四聲於詞庫
若「不」後面接四聲,該「不」應為二聲 (變調),但四聲「不」 (原調) 也應一併加入詞庫
若「不」於結尾,二四聲皆應加入詞庫

例:

僅加入四聲
不丹 ㄅㄨˋ  ㄉㄢ
神智不清 ㄕㄣˊ ㄓˋ ㄅㄨˋ ㄑㄧㄥ
不行 ㄅㄨˋ ㄒㄧㄥˊ
不調 ㄅㄨˋ ㄊㄧㄠˊ
離不了 ㄌㄧˊ ㄅㄨˋ ㄌㄧㄠˇ
為時不遠 ㄨㄟˊ ㄕˊ ㄅㄨˋ ㄩㄢˇ

二四聲皆加入 (變調 + 原調)
戰無不勝 ㄓㄢˋ ㄨˊ ㄅㄨˊ ㄕㄥˋ
戰無不勝 ㄓㄢˋ ㄨˊ ㄅㄨˋ ㄕㄥˋ

二四聲皆加入
也不 ㄧㄝˇ ㄅㄨˋ
也不 ㄧㄝˇ ㄅㄨˊ

[1] https://github.com/openvanilla/McBopomofo/wiki/%E8%A9%9E%E5%BA%AB%E9%96%8B%E7%99%BC%E8%AA%AA%E6%98%8E#%E4%BF%AE%E6%AD%A3%E8%AE%80%E9%9F%B3

xatier commented 4 months ago

@lukhnos thanks for merging this gigantic PR! Thanks everyone for the review and suggestions.