Open soliviantar opened 3 years ago
It's not random, exactly. I'm pretty sure I know why it's happening - it's pulling all the dictionary entries into an array and then selecting the first reading if there's more than one.
It shouldn't happen, but I've never settled on a strategy for dealing with ambiguous entries. The solution will probably involve using some frequency data to weight the choices.
I'll take another look at this soon.
I think you need to remove/exclude/bypass single-character words from the word dictionary. That way it can fall back to the individual hanzi data (which provides the most common reading).
There are also some multi-character words with multiple readings, so if you wanted to be thorough you could try and get some reading frequency data for those and/or use some kind of heuristic to pick the best one.
Here's a query that can help examine:
SELECT * FROM (
SELECT *, count(*) OVER (PARTITION BY simplified) AS count
FROM cidian c
WHERE length(simplified) > 1
)
WHERE count > 2
ORDER BY simplified, traditional, pinyin;
Note that a lot of those rows have the same pinyin reading, just minor differences in notation format, so you/we might want to improve on the normalization in the update script so they're considered equal.
@noinkling
There are also some multi-character words with multiple readings, so if you wanted to be thorough you could try and get some reading frequency data for those and/or use some kind of heuristic to pick the best one.
Here's a query that can help examine:
SELECT * FROM ( SELECT *, count(*) OVER (PARTITION BY simplified) AS count FROM cidian c WHERE length(simplified) > 1 ) WHERE count > 2 ORDER BY simplified, traditional, pinyin;
Note that a lot of those rows have the same pinyin reading, just minor differences in notation format, so you/we might want to improve on the normalization in the update script so they're considered equal.
Based on the data available in ccedict, I do not believe there is a current heuristic available with what we currently have. Is there another data source we can pull to check lexical usage frequency?
I ran into this problem as well. Finding comprehensive frequency data based on pronunciation may be very difficult. Instead, or in the meantime, we could add a manually curated file that sets overrides for common polyphonic words and characters. Shou1, dou1, he2, etc. 啊 has literally every tone in chinese.db (a, a1, a2, a3, a4, a5)! The manually curated file would pick a canonical pinyin for each character/word and we could also allow users to override that.
This is a hard problem. There are quite a few common polyphones that are commonly used with multiple pronunciations. And even Google hasn't really got it figured out. On Google Translate, 行行 gets pronounced as "xing2xing2" but the pinyin transcription is "hang2hang2"! But apparently it should be hang4hang4 according to https://zd.hwxnet.com/search.do?keyword=%E8%A1%8C&x=0&y=0 .
My wife is Chinese so I tasked her with picking the most common pronunciation for a bunch of single character polyphones. It was rather amusing to watch her balk at the task when I gave her the list without accounting for frequency. In English we have tons of rarely used words, but they are usually long and unambiguous (but often difficult) to pronounce. Not so in Chinese, apparently. After I sorted the list by frequency she got a lot of them done and just left a few that she insists are pretty even:
hanzi | most common | alt1 | alt2 | alt3 | alt4 | alt5 |
---|---|---|---|---|---|---|
处 | chu3 | chu4 | ||||
行 | hang2 | xing2 | ||||
数 | shu3 | shu4 | shuo4 | |||
干 | gan1 | gan4 | ||||
量 | liang2 | liang4 |
There are more she left blank but I'm not sure if she just hasn't got to them yet (they're all over rank 2000 in frequency).
Here are her choices for the others: hanzi | most common | alt1 | alt2 | alt3 | alt4 | alt5 |
---|---|---|---|---|---|---|
的 | de5 | de5 | di1 | di2 | di4 | |
和 | he2 | he2 | he4 | hu2 | huo2 | huo4 |
了 | le5 | le5 | liao3 | liao4 | ||
为 | wei4 | wei2 | wei4 | |||
将 | jiang1 | jiang1 | jiang4 | qiang1 | ||
说 | shuo1 | shui4 | shuo1 | |||
与 | yu3 | yu2 | yu3 | yu4 | ||
上 | shang4 | shang3 | shang4 | |||
大 | da4 | da4 | dai4 | |||
要 | yao4 | yao1 | yao4 | |||
地 | de5 | de5 | di4 | |||
据 | ju4 | ju4 | ju1 | |||
着 | zhe5 | zhao1 | zhao2 | zhe5 | zhuo2 | |
种 | zhong3 | zhong3 | zhong4 | |||
把 | ba4 | ba3 | ba4 | |||
比 | bi3 | bi3 | bi1 | |||
好 | hao3 | hao3 | hao4 | |||
同 | tong2 | tong2 | tong4 | |||
分 | fen1 | fen1 | fen4 | |||
更 | geng4 | geng1 | geng4 | |||
会 | hui4 | hui4 | kuai4 | |||
可 | ke3 | ke3 | ke4 | |||
几 | ji3 | ji1 | ji3 | |||
给 | gei3 | gei3 | ji3 | |||
场 | chang3 | chang2 | chang3 | |||
占 | zhan4 | zhan4 | zhan1 | |||
得 | de2 | de2 | de5 | dei3 | ||
长 | chang2 | chang2 | zhang3 | |||
作 | zuo4 | zuo1 | zuo4 | |||
阿 | a1 | a1 | e1 | |||
号 | hao4 | hao2 | hao4 | |||
只 | zhi3 | zhi3 | zhi1 | |||
看 | kan4 | kan1 | kan4 | |||
正 | zheng4 | zheng1 | zheng4 | |||
强 | qiang2 | jiang4 | qiang3 | qiang2 | ||
间 | jian1 | jian1 | jian4 | |||
卡 | ka3 | ka3 | qia3 | |||
当 | dang1 | dang1 | dang4 | |||
打 | da3 | da2 | da3 | |||
底 | di3 | de5 | di3 | |||
夫 | fu1 | fu1 | fu2 | |||
称 | cheng1 | chen4 | cheng1 | cheng4 | ||
便 | bian4 | bian4 | pian2 | |||
头 | tou2 | tou2 | tou5 | |||
那 | na4 | na3 | na4 | |||
少 | shao3 | shao3 | shao4 | |||
难 | nan4 | nan2 | nan4 | |||
发 | fa1 | fa1 | fa4 | |||
令 | ling4 | ling2 | ling3 | ling4 | ||
重 | zhong4 | chong2 | zhong4 | |||
创 | chuang4 | chuang4 | chuang1 | |||
没 | mei2 | mei2 | mo4 | |||
率 | shuai4 | lü4 | shuai4 | |||
奇 | qi2 | ji1 | qi2 | |||
塞 | sai1 | sai1 | sai4 | se4 | ||
华 | hua2 | hua2 | hua1 | |||
勒 | le4 | le4 | lei1 | |||
度 | du4 | du4 | duo2 | |||
差 | cha1 | cha1 | cha4 | chai1 | ||
边 | bian1 | bian1 | bian5 | |||
化 | hua4 | hua1 | hua4 | |||
通 | tong1 | tong1 | tong4 | |||
合 | he2 | ge3 | he2 | |||
跑 | pao3 | pao2 | pao3 | |||
菲 | fei1 | fei1 | fei3 | |||
见 | jian4 | jian4 | xian4 | |||
降 | jiang4 | jiang4 | xiang2 | |||
子 | zi3 | zi3 | zi5 | |||
扎 | zha1 | zha1 | zha2 | za1 | ||
草 | cao3 | cao3 | cao4 | |||
远 | yuan3 | yuan3 | yuan4 | |||
供 | gong1 | gong1 | gong4 | |||
耶 | ye1 | ye1 | ye2 | ye5 | ||
什 | shen2 | shen2 | shi2 | |||
朝 | chao2 | chao2 | zhao1 | |||
假 | jia3 | gei1 | jia3 | jia4 | ||
听 | ting1 | ting1 | ting4 | |||
尽 | jin4 | jin3 | jin4 | |||
呢 | ne5 | ne5 | ni2 | |||
藏 | cang2 | zang4 | cang2 | |||
转 | zhuan3 | zhuai3 | zhuan3 | zhuan4 | ||
蒙 | meng2 | meng2 | meng1 | meng3 | ||
价 | jia4 | jia4 | jie5 | |||
板 | ban3 | ban3 | pan4 | |||
调 | diao4 | diao4 | tiao2 | |||
压 | ya1 | ya1 | ya4 | |||
宁 | ning2 | ning4 | ning2 | |||
句 | ju4 | gou1 | ju4 | |||
抢 | qiang3 | qiang1 | qiang3 | |||
足 | zu2 | ju4 | zu2 | |||
空 | kong1 | kong1 | kong4 | |||
论 | lun4 | lun2 | lun4 | |||
卷 | juan3 | juan3 | juan4 | |||
吗 | ma5 | ma3 | ma5 | |||
传 | chuan2 | chuan2 | zhuan4 | |||
般 | ban1 | ban1 | pan2 | |||
雨 | yu3 | yu3 | yu4 | |||
追 | zhui1 | dui1 | zhui1 | |||
脚 | jiao3 | jiao3 | jue2 | |||
校 | xiao4 | jiao4 | xiao4 | |||
圈 | quan1 | juan1 | juan4 | quan1 | ||
娜 | na4 | na4 | nuo2 | |||
落 | luo4 | la4 | lao4 | luo4 | ||
倒 | dao3 | dao3 | dao4 | |||
伯 | bo2 | ba4 | bai3 | bo2 | ||
结 | jie2 | jie1 | jie2 | |||
冲 | chong1 | chong1 | chong4 | |||
待 | dai4 | dai1 | dai4 | |||
鸟 | niao3 | diao3 | niao3 | |||
采 | cai3 | cai3 | cai4 | |||
读 | du2 | dou4 | du2 | |||
哪 | na3 | na3 | na5 | nei3 | ||
杆 | gan1 | gan1 | gan3 | |||
丽 | li2 | li2 | li4 | |||
氏 | shi4 | shi4 | zhi1 | |||
堡 | bao3 | bao3 | pu4 | |||
色 | se4 | se4 | shai3 | |||
畜 | chu4 | chu4 | xu4 | |||
铺 | pu4 | pu4 | pu1 | |||
佛 | fo2 | fo2 | fu2 | |||
炸 | zha4 | zha2 | zha4 | |||
免 | mian3 | mian3 | wen4 | |||
答 | da2 | da1 | da2 | |||
骑 | qi2 | ji4 | qi2 | |||
载 | zai3 | zai3 | zai4 | |||
喝 | he1 | he1 | he4 | |||
幢 | zhuang4 | chuang2 | zhuang4 | |||
背 | bei1 | bei1 | bei4 | |||
页 | ye4 | xie2 | ye4 | |||
奔 | ben1 | ben1 | ben4 | |||
症 | zheng4 | zheng4 | zheng1 | |||
横 | heng2 | heng2 | heng4 | |||
食 | shi2 | shi2 | si4 | |||
尾 | wei3 | wei3 | yi3 | |||
稍 | shao1 | shao1 | shao4 | |||
术 | shu4 | shu4 | zhu2 | |||
语 | yu3 | yu3 | yu4 | |||
累 | lei4 | lei3 | lei4 | |||
趟 | tang4 | tang1 | tang4 | |||
亲 | qin1 | qin1 | qing4 | |||
划 | hua2 | hua2 | hua4 | |||
涌 | yong3 | chong1 | yong3 | |||
散 | san4 | san4 | san3 | |||
担 | dan4 | dan1 | dan4 | |||
赚 | zhuan4 | zhuan4 | zuan4 | |||
服 | fu2 | fu2 | fu4 | |||
操 | cao1 | cao1 | cao4 | |||
解 | jie3 | jie3 | jie4 | |||
咱 | za2 | zan2 | za2 | |||
岭 | ling3 | ling2 | ling3 | |||
鲜 | xian1 | xian3 | xian1 | |||
弹 | tan2 | dan4 | tan2 | |||
涨 | zhang3 | zhang3 | zhang4 | |||
甚 | shen4 | shen2 | shen4 | |||
薄 | bao2 | bao2 | bo4 | |||
撒 | sa3 | sa1 | sa3 | |||
斗 | dou3 | dou3 | dou4 | |||
禁 | jin4 | jin1 | jin4 | |||
泥 | ni2 | ni2 | ni4 | |||
挑 | tiao1 | tiao1 | tiao3 | |||
埋 | mai2 | mai2 | man2 | |||
钻 | zuan4 | zuan4 | zuan1 | |||
折 | zhe2 | she2 | zhe1 | zhe2 | ||
监 | jian1 | jian1 | jian4 | |||
挣 | zheng4 | zheng1 | zheng4 | |||
挡 | dang3 | dang3 | dang4 | |||
摸 | mo1 | mo1 | mo2 | |||
炮 | pao4 | bao1 | pao2 | pao4 | ||
参 | can1 | can1 | shen1 | |||
辟 | pi4 | bi4 | pi4 | |||
似 | si4 | shi4 | si4 | |||
屯 | tun2 | tun2 | zhun1 | |||
虾 | xia1 | ha2 | xia1 | |||
弄 | nong4 | long4 | nong4 | |||
秘 | mi4 | mi4 | bi4 | |||
喷 | pen1 | pen1 | pen4 | |||
荷 | he2 | he2 | he4 | |||
泡 | pao4 | pao1 | pao4 | |||
浅 | qian3 | jian1 | qian3 | |||
否 | fou3 | fou3 | pi3 | |||
混 | hun4 | hun2 | hun4 | |||
匹 | pi3 | pi1 | pi3 | |||
磨 | mo2 | mo2 | mo4 | |||
沈 | shen3 | chen2 | shen3 | |||
模 | mo2 | mo2 | mu2 | |||
夹 | jia2 | jia1 | jia2 | jia4 | ||
踏 | ta4 | ta1 | ta4 | |||
渐 | jian4 | jian1 | jian4 | |||
尺 | chi3 | che3 | chi3 | |||
扛 | kang2 | gang1 | kang2 | |||
喂 | wei4 | wei2 | wei4 | |||
浆 | jiang1 | jiang1 | jiang4 | |||
轴 | zhou2 | zhou2 | zhou4 | |||
谜 | mi2 | mei4 | mi2 | |||
档 | dang3 | dang3 | dang4 | |||
拚 | pin1 | pan4 | pin1 | |||
脏 | zang1 | zang4 | zang1 | |||
歪 | wai1 | wai1 | wai3 | |||
莎 | sha1 | sha1 | suo1 | |||
扫 | sao3 | sao3 | sao4 | |||
沉 | chen2 | chen1 | chen2 | |||
识 | shi2 | shi2 | zhi4 | |||
侧 | ce4 | ce4 | zhai1 | |||
仔 | zai3 | zai3 | zi1 | zi3 | ||
恶 | e4 | e3 | e4 | wu4 | ||
晃 | huang4 | huang3 | huang4 | |||
尿 | niao4 | niao4 | sui1 | |||
臭 | chou4 | chou4 | xiu4 | |||
饮 | yin3 | yin3 | yin4 | |||
柜 | gui4 | gui4 | ju3 | |||
脉 | mai4 | mai4 | mo4 | |||
芯 | xin1 | xin1 | xin4 | |||
凉 | liang2 | liang2 | liang4 | |||
妻 | qi1 | qi1 | qi4 | |||
挨 | ai1 | ai1 | ai2 | |||
缝 | feng2 | feng2 | feng4 | |||
腊 | la4 | xi1 | la4 | |||
刹 | sha1 | cha4 | sha1 | |||
罢 | ba4 | ba4 | ba5 | |||
宿 | su4 | su4 | xiu3 | xiu4 | ||
刷 | shua1 | shua1 | shua4 | |||
俊 | jun4 | jun4 | zun4 | |||
膏 | gao1 | gao1 | gao4 | |||
抹 | mo3 | ma1 | mo3 | mo4 | ||
削 | xiao1 | xiao1 | xue1 | |||
唯 | wei2 | wei2 | wei3 | |||
咋 | za3 | za3 | ze2 | zha4 | ||
旋 | xuan2 | xuan2 | xuan4 | |||
钉 | ding1 | ding1 | ding4 | |||
拾 | shi2 | she4 | shi2 | |||
澄 | cheng2 | cheng2 | deng4 | |||
掺 | chan1 | chan1 | shan3 | |||
桔 | ju2 | jie2 | ju2 | |||
杠 | gang4 | gang1 | gang4 | |||
锯 | ju4 | ju1 | ju4 | |||
厕 | ce4 | ce4 | si4 | |||
匾 | bian3 | bian3 | pian2 | |||
揣 | chuai3 | chuai1 | chuai3 | |||
弟 | di4 | di4 | ti4 | |||
笼 | long2 | long2 | long3 | |||
衰 | shuai1 | cui1 | shuai1 | |||
胖 | pang4 | pan2 | pang4 | |||
汞 | gong3 | gong3 | hong4 | |||
漂 | piao1 | piao1 | piao3 | piao4 | ||
殷 | yin3 | yan1 | yin3 | |||
劈 | pi1 | pi1 | pi3 | |||
驮 | tuo2 | tuo2 | duo4 | |||
刨 | pao2 | bao4 | pao2 | |||
粥 | zhou1 | yu4 | zhou1 | |||
朴 | pu3 | po4 | pu3 | |||
估 | gu1 | gu1 | gu4 | |||
熬 | ao2 | ao1 | ao2 | |||
隐 | yin3 | yin3 | yin4 | |||
卒 | zu2 | zu2 | cu4 | |||
么 | me5 | ma2 | ma5 | me5 | ||
叉 | cha1 | cha1 | cha2 | cha3 | ||
舌 | she2 | ji1 | she2 | |||
攒 | zan3 | cuan2 | zan3 | |||
楞 | leng4 | leng2 | leng4 | |||
雀 | que4 | qiao1 | que4 | |||
荡 | dang4 | dang4 | tang4 | |||
淋 | lin2 | lin2 | lin4 | |||
缉 | ji1 | ji1 | qi1 | |||
凹 | ao1 | ao1 | wa1 |
So does it seem reasonable to put these in a table in the addin for it to use to look in first for transcription before falling back on the first one in the chinese.db? I can also do two character polyphones.
First of all, thanks for your great work on thie addon! It is a life saver.
Describe the bug For some reason, some characters like 说 seem to get assigned a reading... randomly. Sometimes it gets a shuō reading and sometimes the shuì reading. I can't find a way to control which one it'll get. I suppose 说 should have the default set to shuō in sentences like 你跟他说, 什么说呀, 把话说清楚, 说一下你的地址 or 说吧. But I am getting shuì for these. I can run a search and replace for the shui's and change it for shuo's and running bulk-fill transcriptions again will change the color of the pinyin, but not of the Color field. I have to manually go to the Reading field and tab away from it to make it update the Color field. This happened in about 170 cards out of 274 having 说. The dū reading also appears for 都 in all cases, I think. But that is easier to fix since it doesn't affect tone.
To Reproduce Steps to reproduce the behavior: Add sentence with character 说 Bulkfill transcriptions Some 说 will get shui reading and be colored as 4th tone Changing the tone in the Reading field changes the reading color, but not the hanzi color
Expected behavior I think the shuō reading should be the default for most cases.
Specs (please complete the following information):
Something similar was reported here: https://github.com/luoliyan/chinese-support-redux/issues/144