jacksonllee / pycantonese

Cantonese Linguistics and NLP
https://pycantonese.org
MIT License
354 stars 38 forks source link

possible to add a custom lookup dict for characters_to_jyutping #37

Open raymond00000 opened 1 year ago

raymond00000 commented 1 year ago

Describe the bug I read this and understand the corpora used for characters_to_jyutping are. (i) the HKCanCor corpus data included in the PyCantonese library, and (ii) the rime-cantonese data https://pycantonese.org/jyutping.html

The issue I found is, it seems at least one word, if converted to jyutping, give an incorrect jyutping result?

To reproduce pycantonese.characters_to_jyutping('到') [('到', 'dou2')] pycantonese.characters_to_jyutping('感到') [('感到', 'gam2dou2')] pycantonese.characters_to_jyutping('到底') [('到底', 'dou3dai2')]

Expected behavior according to here. https://humanum.arts.cuhk.edu.hk/Lexis/lexi-can/ 到 should be dou3, so expected results are: pycantonese.characters_to_jyutping('到') [('到', 'dou3')] pycantonese.characters_to_jyutping('感到') [('感到', 'gam2dou3')] pycantonese.characters_to_jyutping('到底') [('到底', 'dou3dai2')]

I wonder if there is any way to resolve this problem, so pycantonese.characters_to_jyutping will return dou3 for 到 and 感到? Thanks!

jacksonllee commented 1 year ago

Hi, sorry for not replying earlier. Between rime-cantonese and HKCanCor, the current code prefers the rime-cantonese data in case the two data sources don't agree. I'll have to dig into what the included rime-cantonese data looks like. Maybe the upstream rime-cantonese data has been updated and I could just use the updated data, or I could override these known cases. Thank you for reporting this!

laubonghaudoi commented 1 year ago

So I checked the rime-cantonese data, at least for 感到 and 到底 in word.csv, the prons are gam2 dou3 and dou3 dai2 which are correct.

jacksonllee commented 1 year ago

@laubonghaudoi Ah, I had no idea you guys had set up the CanCLID/rime-cantonese-upstream repo! Now I also see the char.csv file with this:

到,dou2,常用,,,
到,dou3,常用,,,

For my purposes, I'd need an automatic way to tell which char (or word, if this happens in word.csv) to pick for its jyutping. Is it safe to always choose the last one? Or is there another lookup or something?