wanleung / libcangjie

CangJie Input Method Library
GNU Lesser General Public License v3.0
14 stars 8 forks source link

For the sequence of wildcard or quick #14

Open wanleung opened 11 years ago

wanleung commented 11 years ago

After use the frequency data, it is max up with simplify Chinese and Tradition Chinese, which is really not user friendly for Hong Kong People.

H*A 香 白 简 簡 籍 昏 稽 舶 稲 筍 簪 箔 徇 箸 昝 昝 艚 稭 皛 氇 艪 昋 徻 氆 穭 氌 簮 馫 稥 徣 簎 濌 凮 凬 軄 箺 穞 稓 毥 晵

For the 1st 10 choice, 3 and 9 are simplify Chinese and the others are traditional Chinese. and 3 is the same char of 4 which 3 is simplify and 4 is traditional.

I*I 为 府 魔 魔 底 讨 谢 麼 麼 冷 诗 廚 広 禱 戍 讼 麽 祷 飆 麝 祗 祛 廕 诋 廆 祔 禨 戔 凇 庒 禌 讱 尃 螚 诪 厽 礿 庤 庅 厸 蠯 螷 禣

same here, 1, 4, 6, 7, 11 13, 16 are simplify Chinese.

The frequency data should separate into 2 tables. one is for simplify and the other is for traditional chinese. And should have a choice in setting that is traditional 1st or simplify 1st.

Benau commented 11 years ago

if a user doesn't choose "also suggest simplified chinese characters" in ibus-cangjie, SC won't appear in the candidates, and it will become:

ha: 香', '白', '簡', '籍', '昏', '稽', '舶', '筍', '簪', '箔', '徇', '箸', '昝', '艚', '皛', '昋', '徻', '氆', '穭', '氌', '馫', '簎', '濌', '稓'

ii: '府', '魔', '麼', '冷', '廚', '禱', '戍', '飆', '麝', '祛', '廕', '廆', '祔', '禨', '戔', '尃', '螚', '礿', '庤', '蠯', '螷'

unless we want a "SC" only version of cangjie, which is table-related (need a table without TC).

bochecha commented 11 years ago

You guys are talking about 2 different use cases.

@Benau: yes, when filtering only TC then of course the candidates are only TC.

But the point @wanleung makes is interesting nevertheless: how do we want to order TC and SC to each other?

For the record, ibus-table has this ChineseMode option which allows to both set a filter and the ordering. It takes several values:

Now the last thing I want is to replicate something like that in ibus-cangjie. Asking the user this way is a cop out, we can do better.

But we need to think hard about what we want the user to see, and who our users are.

The primary target users for ibus-cangjie are Hong Kong people, who mostly write TC. So ordering TC before SC makes sense for ibus-cangjie.

Now, who is the primary target user for libcangjie?

bochecha commented 11 years ago

So, we just talked about this with @wanleung, and we agreed that libcangjie should remain a generic library, that is too deal with mechanism and not with the policy.

This will allow user components (like ibus-cangjie) to implement whatever policy they deem better for their users.

So for example, ibus-cangjie could decide to present TC first, and SC next, but libcangjie wouldn't force that, another libcangjie user could decide to put SC first and TC next, or to mix them all, or...

That will require some work though, the current situation isn't satisfying: in effect, right now libcangjie is doing both mechanism and policy: the policy to mix everything altogether. :)

wanleung commented 11 years ago

In Cangjie, it was mainly designed for TC, although it can also support SC. and mainland chinese doesn't use Cangjie at all. The including of SC is just convenient for TC people who know how to crack the SC.

so that there is no point to put the SC at the front. of course, as @bochecha said, libcangjie is a generic lib, So for the returning of the result, I won't sort it and just let the engine to sort it. (Or I could provide functions or settings to return the sorted list if the other engine wanted. TC > SC or SC < TC)

As all the frequency data are generated from the WIKI, and as wiki has TC and SC version, I need a separate frequency data, 1 for TC and 1 for SC. Why I have to have this? It is because SC has some chars are a subset of TC, and those char in TC and SC may be different.

But beware that even in the TC version of wiki, there are still quite a lot place have SC char mistake.

Benau commented 11 years ago

"It is because SC has some chars are a subset of TC, and those char in TC and SC may be different.

But beware that even in the TC version of wiki, there are still quite a lot place have SC char mistake."

http://en.wikipedia.org/wiki/Ambiguities_in_Chinese_character_simplification

You meant those????

My thoughts: So we regenerated our frequency database with 2 more dbs:

A: db with all sc put -1 B: db with all tc put -1

but without the characters from above wiki, as some has both writing in sc/tc.

so, from @bochecha:

0 means to show simplified Chinese only 1 means to show traditional Chinese only 2 means to show all characters but show simplified Chinese first 3 means to show all characters but show traditional Chinese first 4 means to show all characters

0: use oringinal freq table with cjx-sc (so tc will be filtered out) / use B freq table with all -1 dropped 1: use oringinal freq table with cjx-tc (so sc will be filtered out) / use A freq table with all -1 dropped 2: use B freq table 3: use A freq table 4: use the current method Normal HK citizen should choose A freq table.

better now?

If you think it's a good idea, i will start working on it.

bochecha commented 11 years ago

so, from @bochecha:

0 means to show simplified Chinese only 1 means to show traditional Chinese only 2 means to show all characters but show simplified Chinese first 3 means to show all characters but show traditional Chinese first 4 means to show all characters

Do not do that!

It was an example of what ibus-table is doing, and we should not replicate others' mistakes.

Benau commented 11 years ago

@bochecha has a better idea to filter sc/tc as in irc.

so forget about my previous comment

wanleung commented 11 years ago

I need 2 different db. 1 is from TC (or you can use http://humanum.arts.cuhk.edu.hk/Lexis/lexi-can/ if license allows) 1 is from SC. That's it.

I will also import the old quick 3 table as well for the widows order that was made by Roy Chan on Linux.

If possible, don;t mess up the libcangjie code 1st. let the current freq version. we have to think more what we are working on.

mahiuchun commented 11 years ago

If you want an "established" order, you may also consider the data from OpenVanilla project or now open sourced Yahoo KeyKey!

bochecha commented 11 years ago

Thanks for the comment @maxiaojun, it is very helpful, especially since you provided a link to these projects you mention.

So I had a look at Yahoo KeyKey, which I believe is https://github.com/yahoo/KeyKey/

But I honestly don't see how that would be useful for us.

The only thing I could find are these 3 tables of characters:

But they don't seem to contain any ordering/frequency information.

I also looked at OpenVanilla, is this https://github.com/lukhnos/openvanilla/?

Same here, I can find some characters table, but I can't find where their ordering/frequency information is encoded.

You seem to imply that they provide good ordering of candidates, in a form we could reuse here, can you give us more details?

bochecha commented 11 years ago

I believe this is not a problem any more, now that we have the classic frequency?