jdlorimer / chinese-support-redux

Anki add-on providing support for Chinese study
https://ankiweb.net/shared/info/1128979221
GNU General Public License v3.0
101 stars 50 forks source link

Some characters get assigned incorrect readings randomly #173

Open soliviantar opened 3 years ago

soliviantar commented 3 years ago

First of all, thanks for your great work on thie addon! It is a life saver.

Describe the bug For some reason, some characters like 说 seem to get assigned a reading... randomly. Sometimes it gets a shuō reading and sometimes the shuì reading. I can't find a way to control which one it'll get. I suppose 说 should have the default set to shuō in sentences like 你跟他说, 什么说呀, 把话说清楚, 说一下你的地址 or 说吧. But I am getting shuì for these. I can run a search and replace for the shui's and change it for shuo's and running bulk-fill transcriptions again will change the color of the pinyin, but not of the Color field. I have to manually go to the Reading field and tab away from it to make it update the Color field. This happened in about 170 cards out of 274 having 说. The dū reading also appears for 都 in all cases, I think. But that is easier to fix since it doesn't affect tone.

To Reproduce Steps to reproduce the behavior: Add sentence with character 说 Bulkfill transcriptions Some 说 will get shui reading and be colored as 4th tone Changing the tone in the Reading field changes the reading color, but not the hanzi color

Expected behavior I think the shuō reading should be the default for most cases.

Specs (please complete the following information):

Something similar was reported here: https://github.com/luoliyan/chinese-support-redux/issues/144

jdlorimer commented 3 years ago

It's not random, exactly. I'm pretty sure I know why it's happening - it's pulling all the dictionary entries into an array and then selecting the first reading if there's more than one.

It shouldn't happen, but I've never settled on a strategy for dealing with ambiguous entries. The solution will probably involve using some frequency data to weight the choices.

I'll take another look at this soon.

noinkling commented 3 years ago

I think you need to remove/exclude/bypass single-character words from the word dictionary. That way it can fall back to the individual hanzi data (which provides the most common reading).

noinkling commented 3 years ago

There are also some multi-character words with multiple readings, so if you wanted to be thorough you could try and get some reading frequency data for those and/or use some kind of heuristic to pick the best one.

Here's a query that can help examine:

SELECT * FROM (
    SELECT *, count(*) OVER (PARTITION BY simplified) AS count
    FROM cidian c
    WHERE length(simplified) > 1
)
WHERE count > 2
ORDER BY simplified, traditional, pinyin;

Note that a lot of those rows have the same pinyin reading, just minor differences in notation format, so you/we might want to improve on the normalization in the update script so they're considered equal.

joeminicucci commented 3 years ago

@noinkling

There are also some multi-character words with multiple readings, so if you wanted to be thorough you could try and get some reading frequency data for those and/or use some kind of heuristic to pick the best one.

Here's a query that can help examine:

SELECT * FROM (
  SELECT *, count(*) OVER (PARTITION BY simplified) AS count
  FROM cidian c
  WHERE length(simplified) > 1
)
WHERE count > 2
ORDER BY simplified, traditional, pinyin;

Note that a lot of those rows have the same pinyin reading, just minor differences in notation format, so you/we might want to improve on the normalization in the update script so they're considered equal.

Based on the data available in ccedict, I do not believe there is a current heuristic available with what we currently have. Is there another data source we can pull to check lexical usage frequency?

chambm commented 2 years ago

I ran into this problem as well. Finding comprehensive frequency data based on pronunciation may be very difficult. Instead, or in the meantime, we could add a manually curated file that sets overrides for common polyphonic words and characters. Shou1, dou1, he2, etc. 啊 has literally every tone in chinese.db (a, a1, a2, a3, a4, a5)! The manually curated file would pick a canonical pinyin for each character/word and we could also allow users to override that.

chambm commented 2 years ago

This is a hard problem. There are quite a few common polyphones that are commonly used with multiple pronunciations. And even Google hasn't really got it figured out. On Google Translate, 行行 gets pronounced as "xing2xing2" but the pinyin transcription is "hang2hang2"! But apparently it should be hang4hang4 according to https://zd.hwxnet.com/search.do?keyword=%E8%A1%8C&x=0&y=0 .

My wife is Chinese so I tasked her with picking the most common pronunciation for a bunch of single character polyphones. It was rather amusing to watch her balk at the task when I gave her the list without accounting for frequency. In English we have tons of rarely used words, but they are usually long and unambiguous (but often difficult) to pronounce. Not so in Chinese, apparently. After I sorted the list by frequency she got a lot of them done and just left a few that she insists are pretty even:

hanzi most common alt1 alt2 alt3 alt4 alt5
  chu3 chu4      
  hang2 xing2      
  shu3 shu4 shuo4    
  gan1 gan4      
  liang2 liang4      

There are more she left blank but I'm not sure if she just hasn't got to them yet (they're all over rank 2000 in frequency).

Here are her choices for the others: hanzi most common alt1 alt2 alt3 alt4 alt5
de5 de5 di1 di2 di4  
he2 he2 he4 hu2 huo2 huo4
le5 le5 liao3 liao4    
wei4 wei2 wei4      
jiang1 jiang1 jiang4 qiang1    
shuo1 shui4 shuo1      
yu3 yu2 yu3 yu4    
shang4 shang3 shang4      
da4 da4 dai4      
yao4 yao1 yao4      
de5 de5 di4      
ju4 ju4 ju1      
zhe5 zhao1 zhao2 zhe5 zhuo2  
zhong3 zhong3 zhong4      
ba4 ba3 ba4      
bi3 bi3 bi1      
hao3 hao3 hao4      
tong2 tong2 tong4      
fen1 fen1 fen4      
geng4 geng1 geng4      
hui4 hui4 kuai4      
ke3 ke3 ke4      
ji3 ji1 ji3      
gei3 gei3 ji3      
chang3 chang2 chang3      
zhan4 zhan4 zhan1      
de2 de2 de5 dei3    
chang2 chang2 zhang3      
zuo4 zuo1 zuo4      
a1 a1 e1      
hao4 hao2 hao4      
zhi3 zhi3 zhi1      
kan4 kan1 kan4      
zheng4 zheng1 zheng4      
qiang2 jiang4 qiang3 qiang2    
jian1 jian1 jian4      
ka3 ka3 qia3      
dang1 dang1 dang4      
da3 da2 da3      
di3 de5 di3      
fu1 fu1 fu2      
cheng1 chen4 cheng1 cheng4    
便 bian4 bian4 pian2      
tou2 tou2 tou5      
na4 na3 na4      
shao3 shao3 shao4      
nan4 nan2 nan4      
fa1 fa1 fa4      
ling4 ling2 ling3 ling4    
zhong4 chong2 zhong4      
chuang4 chuang4 chuang1      
mei2 mei2 mo4      
shuai4 lü4 shuai4      
qi2 ji1 qi2      
sai1 sai1 sai4 se4    
hua2 hua2 hua1      
le4 le4 lei1      
du4 du4 duo2      
cha1 cha1 cha4 chai1    
bian1 bian1 bian5      
hua4 hua1 hua4      
tong1 tong1 tong4      
he2 ge3 he2      
pao3 pao2 pao3      
fei1 fei1 fei3      
jian4 jian4 xian4      
jiang4 jiang4 xiang2      
zi3 zi3 zi5      
zha1 zha1 zha2 za1    
cao3 cao3 cao4      
yuan3 yuan3 yuan4      
gong1 gong1 gong4      
ye1 ye1 ye2 ye5    
shen2 shen2 shi2      
chao2 chao2 zhao1      
jia3 gei1 jia3 jia4    
ting1 ting1 ting4      
jin4 jin3 jin4      
ne5 ne5 ni2      
cang2 zang4 cang2      
zhuan3 zhuai3 zhuan3 zhuan4    
meng2 meng2 meng1 meng3    
jia4 jia4 jie5      
ban3 ban3 pan4      
diao4 diao4 tiao2      
ya1 ya1 ya4      
ning2 ning4 ning2      
ju4 gou1 ju4      
qiang3 qiang1 qiang3      
zu2 ju4 zu2      
kong1 kong1 kong4      
lun4 lun2 lun4      
juan3 juan3 juan4      
ma5 ma3 ma5      
chuan2 chuan2 zhuan4      
ban1 ban1 pan2      
yu3 yu3 yu4      
zhui1 dui1 zhui1      
jiao3 jiao3 jue2      
xiao4 jiao4 xiao4      
quan1 juan1 juan4 quan1    
na4 na4 nuo2      
luo4 la4 lao4 luo4    
dao3 dao3 dao4      
bo2 ba4 bai3 bo2    
jie2 jie1 jie2      
chong1 chong1 chong4      
dai4 dai1 dai4      
niao3 diao3 niao3      
cai3 cai3 cai4      
du2 dou4 du2      
na3 na3 na5 nei3    
gan1 gan1 gan3      
li2 li2 li4      
shi4 shi4 zhi1      
bao3 bao3 pu4      
se4 se4 shai3      
chu4 chu4 xu4      
pu4 pu4 pu1      
fo2 fo2 fu2      
zha4 zha2 zha4      
mian3 mian3 wen4      
da2 da1 da2      
qi2 ji4 qi2      
zai3 zai3 zai4      
he1 he1 he4      
zhuang4 chuang2 zhuang4      
bei1 bei1 bei4      
ye4 xie2 ye4      
ben1 ben1 ben4      
zheng4 zheng4 zheng1      
heng2 heng2 heng4      
shi2 shi2 si4      
wei3 wei3 yi3      
shao1 shao1 shao4      
shu4 shu4 zhu2      
yu3 yu3 yu4      
lei4 lei3 lei4      
tang4 tang1 tang4      
qin1 qin1 qing4      
hua2 hua2 hua4      
yong3 chong1 yong3      
san4 san4 san3      
dan4 dan1 dan4      
zhuan4 zhuan4 zuan4      
fu2 fu2 fu4      
cao1 cao1 cao4      
jie3 jie3 jie4      
za2 zan2 za2      
ling3 ling2 ling3      
xian1 xian3 xian1      
tan2 dan4 tan2      
zhang3 zhang3 zhang4      
shen4 shen2 shen4      
bao2 bao2 bo4      
sa3 sa1 sa3      
dou3 dou3 dou4      
jin4 jin1 jin4      
ni2 ni2 ni4      
tiao1 tiao1 tiao3      
mai2 mai2 man2      
zuan4 zuan4 zuan1      
zhe2 she2 zhe1 zhe2    
jian1 jian1 jian4      
zheng4 zheng1 zheng4      
dang3 dang3 dang4      
mo1 mo1 mo2      
pao4 bao1 pao2 pao4    
can1 can1 shen1      
pi4 bi4 pi4      
si4 shi4 si4      
tun2 tun2 zhun1      
xia1 ha2 xia1      
nong4 long4 nong4      
mi4 mi4 bi4      
pen1 pen1 pen4      
he2 he2 he4      
pao4 pao1 pao4      
qian3 jian1 qian3      
fou3 fou3 pi3      
hun4 hun2 hun4      
pi3 pi1 pi3      
mo2 mo2 mo4      
shen3 chen2 shen3      
mo2 mo2 mu2      
jia2 jia1 jia2 jia4    
ta4 ta1 ta4      
jian4 jian1 jian4      
chi3 che3 chi3      
kang2 gang1 kang2      
wei4 wei2 wei4      
jiang1 jiang1 jiang4      
zhou2 zhou2 zhou4      
mi2 mei4 mi2      
dang3 dang3 dang4      
pin1 pan4 pin1      
zang1 zang4 zang1      
wai1 wai1 wai3      
sha1 sha1 suo1      
sao3 sao3 sao4      
chen2 chen1 chen2      
shi2 shi2 zhi4      
ce4 ce4 zhai1      
zai3 zai3 zi1 zi3    
e4 e3 e4 wu4    
huang4 huang3 huang4      
尿 niao4 niao4 sui1      
chou4 chou4 xiu4      
yin3 yin3 yin4      
gui4 gui4 ju3      
mai4 mai4 mo4      
xin1 xin1 xin4      
liang2 liang2 liang4      
qi1 qi1 qi4      
ai1 ai1 ai2      
feng2 feng2 feng4      
la4 xi1 la4      
sha1 cha4 sha1      
ba4 ba4 ba5      
宿 su4 su4 xiu3 xiu4    
shua1 shua1 shua4      
jun4 jun4 zun4      
gao1 gao1 gao4      
mo3 ma1 mo3 mo4    
xiao1 xiao1 xue1      
wei2 wei2 wei3      
za3 za3 ze2 zha4    
xuan2 xuan2 xuan4      
ding1 ding1 ding4      
shi2 she4 shi2      
cheng2 cheng2 deng4      
chan1 chan1 shan3      
ju2 jie2 ju2      
gang4 gang1 gang4      
ju4 ju1 ju4      
ce4 ce4 si4      
bian3 bian3 pian2      
chuai3 chuai1 chuai3      
di4 di4 ti4      
long2 long2 long3      
shuai1 cui1 shuai1      
pang4 pan2 pang4      
gong3 gong3 hong4      
piao1 piao1 piao3 piao4    
yin3 yan1 yin3      
pi1 pi1 pi3      
tuo2 tuo2 duo4      
pao2 bao4 pao2      
zhou1 yu4 zhou1      
pu3 po4 pu3      
gu1 gu1 gu4      
ao2 ao1 ao2      
yin3 yin3 yin4      
zu2 zu2 cu4      
me5 ma2 ma5 me5    
cha1 cha1 cha2 cha3    
she2 ji1 she2      
zan3 cuan2 zan3      
leng4 leng2 leng4      
que4 qiao1 que4      
dang4 dang4 tang4      
lin2 lin2 lin4      
ji1 ji1 qi1      
ao1 ao1 wa1      

So does it seem reasonable to put these in a table in the addin for it to use to look in first for transcription before falling back on the first one in the chinese.db? I can also do two character polyphones.