openvanilla / McBopomofo

小麥注音輸入法
http://mcbopomofo.openvanilla.org/
MIT License
617 stars 76 forks source link

Covering ambiguity improvements #332

Open lukhnos opened 2 years ago

lukhnos commented 2 years ago

Thanks for the insightful contribution—I could run the analysis without any problem and didn't realize the prevalence of, for example, conflicting/shadowing 3-syllable scripts such as (電子式) vs (電子)(是) and (微電腦) vs (為)(電腦).

Glad to hear that it works for you. BTW, technically those are "covering" ambiguities. Conflicting ambiguities are more like (成分)(子) vs. (成)(分子). Not really important for the discussion here, though.

Is there anything you think we can pursue? The improvements in #329 should mitigate the issues with 電子式 and 微電腦 once the user has chosen the candidate, but it's still odd that phrases like 工作證 are not chosen in the first place. Is there anything we can do for those?

Yes, there are some options. IMO. They are not mutually exclusive and probably only matters of difficulty and urgency.

I will elaborate (1.b), (3), and (4) at the bottom of this comment. Also, some of the above, especially (4), are related to https://github.com/openvanilla/McBopomofo/blob/master/Source/Data/bin/buildFreq.py#L51

Meanwhile, it would be nice to enhance https://github.com/openvanilla/McBopomofo/blob/master/Source/Data/bin/self-score-test.py with the approach of this PR and #329. However, since GitHub Actions limit the free running hours, the implementation of this PR can use a faster algorithm for CI/CD, which is why I marked this PR as a draft. Or even better, test the engine directly.

For #300, it is also recommended to have user-defined scores included in the test. And then one may see whether it is necessary to change the scoring function or even the entire online learning algorithm.


Elaborations

(4) can help a lot if using (3), since (3) usually doesn't have segmentations.

Originally posted by @tianjianjiang in https://github.com/openvanilla/McBopomofo/issues/330#issuecomment-1186871718

tianjianjiang commented 1 year ago

Gosh it's been almost a year... I am going to refresh my recollection on this and see what I can do.