openvanilla / McBopomofo

小麥注音輸入法
http://mcbopomofo.openvanilla.org/
MIT License
615 stars 76 forks source link

Keep BPMFMappings.txt and phrase.occ sorted #477

Closed lukhnos closed 3 months ago

lukhnos commented 3 months ago

Fixes #447

This also dedups phrase.occ and fixes a few issues with phrases such as 電子 (vs 靛紫) due to the inadvertent dups having much lower scores.

phrase.occ sorted and deduped with a one-time script: https://gist.github.com/lukhnos/2d4c628e2690b5407777f2b75f699a89

cc @xatier

xatier commented 3 months ago

Thank you so much for this change! This will make the dictionary way easier to maintainable. Let's also update the wiki once this is merged [1]

[1] https://github.com/openvanilla/McBopomofo/wiki/%E8%A9%9E%E5%BA%AB%E9%96%8B%E7%99%BC%E8%AA%AA%E6%98%8E

lukhnos commented 3 months ago

@lukhnos I've tried to read the diffs but too complex to be consistently verified, so I just approve this and then make a plan of reimplementing an algorithm to verify it systematically.

Sorry to make this too complex. Here the diff against the current phrase.occ—first sorted with LC_ALL=c sort and all tabs replaced with single spaces—against the one in the PR: diff. There are 26 entries removed, and that's how I discovered that those dups had contributed to issues such as "電子" might not always come before "墊子".

I'll merge this now, and hopefully after the sorting we'll have an easier time reviewing these files.

tianjianjiang commented 3 months ago

@lukhnos I've tried to read the diffs but too complex to be consistently verified, so I just approve this and then make a plan of reimplementing an algorithm to verify it systematically.

Sorry to make this too complex. Here the diff against the current phrase.occ—first sorted with LC_ALL=c sort and all tabs replaced with single spaces—against the one in the PR: diff. There are 26 entries removed, and that's how I discovered that those dups had contributed to issues such as "電子" might not always come before "墊子".

I'll merge this now, and hopefully after the sorting we'll have an easier time reviewing these files.

@lukhnos No worries at all and thank you so much for the clarification!

lukhnos commented 3 months ago

Let's also update the wiki once this is merged

增加了一節 https://github.com/openvanilla/McBopomofo/wiki/詞庫開發說明#請確保-bpmfmappingtxt-以及-phraseocc-兩個檔案的詞條排序