Some characters get assigned incorrect readings randomly

soliviantar commented 3 years ago

First of all, thanks for your great work on thie addon! It is a life saver.

Describe the bug For some reason, some characters like 说 seem to get assigned a reading... randomly. Sometimes it gets a shuō reading and sometimes the shuì reading. I can't find a way to control which one it'll get. I suppose 说 should have the default set to shuō in sentences like 你跟他说, 什么说呀, 把话说清楚, 说一下你的地址 or 说吧. But I am getting shuì for these. I can run a search and replace for the shui's and change it for shuo's and running bulk-fill transcriptions again will change the color of the pinyin, but not of the Color field. I have to manually go to the Reading field and tab away from it to make it update the Color field. This happened in about 170 cards out of 274 having 说. The dū reading also appears for 都 in all cases, I think. But that is easier to fix since it doesn't affect tone.

To Reproduce Steps to reproduce the behavior: Add sentence with character 说 Bulkfill transcriptions Some 说 will get shui reading and be colored as 4th tone Changing the tone in the Reading field changes the reading color, but not the hanzi color

Expected behavior I think the shuō reading should be the default for most cases.

Specs (please complete the following information):

OS: Windows 10
Anki Version 2.1.35
- Chinese Support Version 0.14.0

Something similar was reported here: https://github.com/luoliyan/chinese-support-redux/issues/144

jdlorimer commented 3 years ago

It's not random, exactly. I'm pretty sure I know why it's happening - it's pulling all the dictionary entries into an array and then selecting the first reading if there's more than one.

It shouldn't happen, but I've never settled on a strategy for dealing with ambiguous entries. The solution will probably involve using some frequency data to weight the choices.

I'll take another look at this soon.

noinkling commented 3 years ago

I think you need to remove/exclude/bypass single-character words from the word dictionary. That way it can fall back to the individual hanzi data (which provides the most common reading).

noinkling commented 3 years ago

There are also some multi-character words with multiple readings, so if you wanted to be thorough you could try and get some reading frequency data for those and/or use some kind of heuristic to pick the best one.

Here's a query that can help examine:

SELECT * FROM (
    SELECT *, count(*) OVER (PARTITION BY simplified) AS count
    FROM cidian c
    WHERE length(simplified) > 1
)
WHERE count > 2
ORDER BY simplified, traditional, pinyin;

Note that a lot of those rows have the same pinyin reading, just minor differences in notation format, so you/we might want to improve on the normalization in the update script so they're considered equal.

joeminicucci commented 3 years ago

@noinkling

There are also some multi-character words with multiple readings, so if you wanted to be thorough you could try and get some reading frequency data for those and/or use some kind of heuristic to pick the best one.

Here's a query that can help examine:
SELECT * FROM (
  SELECT *, count(*) OVER (PARTITION BY simplified) AS count
  FROM cidian c
  WHERE length(simplified) > 1
)
WHERE count > 2
ORDER BY simplified, traditional, pinyin;
Note that a lot of those rows have the same pinyin reading, just minor differences in notation format, so you/we might want to improve on the normalization in the update script so they're considered equal.

Based on the data available in ccedict, I do not believe there is a current heuristic available with what we currently have. Is there another data source we can pull to check lexical usage frequency?

chambm commented 2 years ago

I ran into this problem as well. Finding comprehensive frequency data based on pronunciation may be very difficult. Instead, or in the meantime, we could add a manually curated file that sets overrides for common polyphonic words and characters. Shou1, dou1, he2, etc. 啊 has literally every tone in chinese.db (a, a1, a2, a3, a4, a5)! The manually curated file would pick a canonical pinyin for each character/word and we could also allow users to override that.

chambm commented 2 years ago

This is a hard problem. There are quite a few common polyphones that are commonly used with multiple pronunciations. And even Google hasn't really got it figured out. On Google Translate, 行行 gets pronounced as "xing2xing2" but the pinyin transcription is "hang2hang2"! But apparently it should be hang4hang4 according to https://zd.hwxnet.com/search.do?keyword=%E8%A1%8C&x=0&y=0 .

My wife is Chinese so I tasked her with picking the most common pronunciation for a bunch of single character polyphones. It was rather amusing to watch her balk at the task when I gave her the list without accounting for frequency. In English we have tons of rarely used words, but they are usually long and unambiguous (but often difficult) to pronounce. Not so in Chinese, apparently. After I sorted the list by frequency she got a lot of them done and just left a few that she insists are pretty even:

hanzi	alt1	alt2	alt3
处	chu3	chu4
行	hang2	xing2
数	shu3	shu4	shuo4
干	gan1	gan4
量	liang2	liang4

There are more she left blank but I'm not sure if she just hasn't got to them yet (they're all over rank 2000 in frequency).

Here are her choices for the others: hanzi	most common	alt1	alt2	alt3	alt4	alt5
的	de5	de5	di1	di2	di4
和	he2	he2	he4	hu2	huo2	huo4
了	le5	le5	liao3	liao4
为	wei4	wei2	wei4
将	jiang1	jiang1	jiang4	qiang1
说	shuo1	shui4	shuo1
与	yu3	yu2	yu3	yu4
上	shang4	shang3	shang4
大	da4	da4	dai4
要	yao4	yao1	yao4
地	de5	de5	di4
据	ju4	ju4	ju1
着	zhe5	zhao1	zhao2	zhe5	zhuo2
种	zhong3	zhong3	zhong4
把	ba4	ba3	ba4
比	bi3	bi3	bi1
好	hao3	hao3	hao4
同	tong2	tong2	tong4
分	fen1	fen1	fen4
更	geng4	geng1	geng4
会	hui4	hui4	kuai4
可	ke3	ke3	ke4
几	ji3	ji1	ji3
给	gei3	gei3	ji3
场	chang3	chang2	chang3
占	zhan4	zhan4	zhan1
得	de2	de2	de5	dei3
长	chang2	chang2	zhang3
作	zuo4	zuo1	zuo4
阿	a1	a1	e1
号	hao4	hao2	hao4
只	zhi3	zhi3	zhi1
看	kan4	kan1	kan4
正	zheng4	zheng1	zheng4
强	qiang2	jiang4	qiang3	qiang2
间	jian1	jian1	jian4
卡	ka3	ka3	qia3
当	dang1	dang1	dang4
打	da3	da2	da3
底	di3	de5	di3
夫	fu1	fu1	fu2
称	cheng1	chen4	cheng1	cheng4
便	bian4	bian4	pian2
头	tou2	tou2	tou5
那	na4	na3	na4
少	shao3	shao3	shao4
难	nan4	nan2	nan4
发	fa1	fa1	fa4
令	ling4	ling2	ling3	ling4
重	zhong4	chong2	zhong4
创	chuang4	chuang4	chuang1
没	mei2	mei2	mo4
率	shuai4	lü4	shuai4
奇	qi2	ji1	qi2
塞	sai1	sai1	sai4	se4
华	hua2	hua2	hua1
勒	le4	le4	lei1
度	du4	du4	duo2
差	cha1	cha1	cha4	chai1
边	bian1	bian1	bian5
化	hua4	hua1	hua4
通	tong1	tong1	tong4
合	he2	ge3	he2
跑	pao3	pao2	pao3
菲	fei1	fei1	fei3
见	jian4	jian4	xian4
降	jiang4	jiang4	xiang2
子	zi3	zi3	zi5
扎	zha1	zha1	zha2	za1
草	cao3	cao3	cao4
远	yuan3	yuan3	yuan4
供	gong1	gong1	gong4
耶	ye1	ye1	ye2	ye5
什	shen2	shen2	shi2
朝	chao2	chao2	zhao1
假	jia3	gei1	jia3	jia4
听	ting1	ting1	ting4
尽	jin4	jin3	jin4
呢	ne5	ne5	ni2
藏	cang2	zang4	cang2
转	zhuan3	zhuai3	zhuan3	zhuan4
蒙	meng2	meng2	meng1	meng3
价	jia4	jia4	jie5
板	ban3	ban3	pan4
调	diao4	diao4	tiao2
压	ya1	ya1	ya4
宁	ning2	ning4	ning2
句	ju4	gou1	ju4
抢	qiang3	qiang1	qiang3
足	zu2	ju4	zu2
空	kong1	kong1	kong4
论	lun4	lun2	lun4
卷	juan3	juan3	juan4
吗	ma5	ma3	ma5
传	chuan2	chuan2	zhuan4
般	ban1	ban1	pan2
雨	yu3	yu3	yu4
追	zhui1	dui1	zhui1
脚	jiao3	jiao3	jue2
校	xiao4	jiao4	xiao4
圈	quan1	juan1	juan4	quan1
娜	na4	na4	nuo2
落	luo4	la4	lao4	luo4
倒	dao3	dao3	dao4
伯	bo2	ba4	bai3	bo2
结	jie2	jie1	jie2
冲	chong1	chong1	chong4
待	dai4	dai1	dai4
鸟	niao3	diao3	niao3
采	cai3	cai3	cai4
读	du2	dou4	du2
哪	na3	na3	na5	nei3
杆	gan1	gan1	gan3
丽	li2	li2	li4
氏	shi4	shi4	zhi1
堡	bao3	bao3	pu4
色	se4	se4	shai3
畜	chu4	chu4	xu4
铺	pu4	pu4	pu1
佛	fo2	fo2	fu2
炸	zha4	zha2	zha4
免	mian3	mian3	wen4
答	da2	da1	da2
骑	qi2	ji4	qi2
载	zai3	zai3	zai4
喝	he1	he1	he4
幢	zhuang4	chuang2	zhuang4
背	bei1	bei1	bei4
页	ye4	xie2	ye4
奔	ben1	ben1	ben4
症	zheng4	zheng4	zheng1
横	heng2	heng2	heng4
食	shi2	shi2	si4
尾	wei3	wei3	yi3
稍	shao1	shao1	shao4
术	shu4	shu4	zhu2
语	yu3	yu3	yu4
累	lei4	lei3	lei4
趟	tang4	tang1	tang4
亲	qin1	qin1	qing4
划	hua2	hua2	hua4
涌	yong3	chong1	yong3
散	san4	san4	san3
担	dan4	dan1	dan4
赚	zhuan4	zhuan4	zuan4
服	fu2	fu2	fu4
操	cao1	cao1	cao4
解	jie3	jie3	jie4
咱	za2	zan2	za2
岭	ling3	ling2	ling3
鲜	xian1	xian3	xian1
弹	tan2	dan4	tan2
涨	zhang3	zhang3	zhang4
甚	shen4	shen2	shen4
薄	bao2	bao2	bo4
撒	sa3	sa1	sa3
斗	dou3	dou3	dou4
禁	jin4	jin1	jin4
泥	ni2	ni2	ni4
挑	tiao1	tiao1	tiao3
埋	mai2	mai2	man2
钻	zuan4	zuan4	zuan1
折	zhe2	she2	zhe1	zhe2
监	jian1	jian1	jian4
挣	zheng4	zheng1	zheng4
挡	dang3	dang3	dang4
摸	mo1	mo1	mo2
炮	pao4	bao1	pao2	pao4
参	can1	can1	shen1
辟	pi4	bi4	pi4
似	si4	shi4	si4
屯	tun2	tun2	zhun1
虾	xia1	ha2	xia1
弄	nong4	long4	nong4
秘	mi4	mi4	bi4
喷	pen1	pen1	pen4
荷	he2	he2	he4
泡	pao4	pao1	pao4
浅	qian3	jian1	qian3
否	fou3	fou3	pi3
混	hun4	hun2	hun4
匹	pi3	pi1	pi3
磨	mo2	mo2	mo4
沈	shen3	chen2	shen3
模	mo2	mo2	mu2
夹	jia2	jia1	jia2	jia4
踏	ta4	ta1	ta4
渐	jian4	jian1	jian4
尺	chi3	che3	chi3
扛	kang2	gang1	kang2
喂	wei4	wei2	wei4
浆	jiang1	jiang1	jiang4
轴	zhou2	zhou2	zhou4
谜	mi2	mei4	mi2
档	dang3	dang3	dang4
拚	pin1	pan4	pin1
脏	zang1	zang4	zang1
歪	wai1	wai1	wai3
莎	sha1	sha1	suo1
扫	sao3	sao3	sao4
沉	chen2	chen1	chen2
识	shi2	shi2	zhi4
侧	ce4	ce4	zhai1
仔	zai3	zai3	zi1	zi3
恶	e4	e3	e4	wu4
晃	huang4	huang3	huang4
尿	niao4	niao4	sui1
臭	chou4	chou4	xiu4
饮	yin3	yin3	yin4
柜	gui4	gui4	ju3
脉	mai4	mai4	mo4
芯	xin1	xin1	xin4
凉	liang2	liang2	liang4
妻	qi1	qi1	qi4
挨	ai1	ai1	ai2
缝	feng2	feng2	feng4
腊	la4	xi1	la4
刹	sha1	cha4	sha1
罢	ba4	ba4	ba5
宿	su4	su4	xiu3	xiu4
刷	shua1	shua1	shua4
俊	jun4	jun4	zun4
膏	gao1	gao1	gao4
抹	mo3	ma1	mo3	mo4
削	xiao1	xiao1	xue1
唯	wei2	wei2	wei3
咋	za3	za3	ze2	zha4
旋	xuan2	xuan2	xuan4
钉	ding1	ding1	ding4
拾	shi2	she4	shi2
澄	cheng2	cheng2	deng4
掺	chan1	chan1	shan3
桔	ju2	jie2	ju2
杠	gang4	gang1	gang4
锯	ju4	ju1	ju4
厕	ce4	ce4	si4
匾	bian3	bian3	pian2
揣	chuai3	chuai1	chuai3
弟	di4	di4	ti4
笼	long2	long2	long3
衰	shuai1	cui1	shuai1
胖	pang4	pan2	pang4
汞	gong3	gong3	hong4
漂	piao1	piao1	piao3	piao4
殷	yin3	yan1	yin3
劈	pi1	pi1	pi3
驮	tuo2	tuo2	duo4
刨	pao2	bao4	pao2
粥	zhou1	yu4	zhou1
朴	pu3	po4	pu3
估	gu1	gu1	gu4
熬	ao2	ao1	ao2
隐	yin3	yin3	yin4
卒	zu2	zu2	cu4
么	me5	ma2	ma5	me5
叉	cha1	cha1	cha2	cha3
舌	she2	ji1	she2
攒	zan3	cuan2	zan3
楞	leng4	leng2	leng4
雀	que4	qiao1	que4
荡	dang4	dang4	tang4
淋	lin2	lin2	lin4
缉	ji1	ji1	qi1
凹	ao1	ao1	wa1

So does it seem reasonable to put these in a table in the addin for it to use to look in first for transcription before falling back on the first one in the chinese.db? I can also do two character polyphones.

jdlorimer / chinese-support-redux

Some characters get assigned incorrect readings randomly #173