Open chambm opened 2 years ago
This is unnecessary - the Unihan database provides the most common (according to their sources) reading for individual characters in the kMandarin
field, and this data is already included in the SQLite database here in the hanzi
table (easiest way to view it is to use a client like DBeaver or something). I checked a selection of the characters from your table and they agree on the reading.
As I stated here, really all that needs to be done is to make it so that individual characters don't use the word/dictionary table (cidian
).
Bugger, I didn't notice that. I was focused on the cidian table. They do agree 98%. There are cases like this:
hanzi | zidian | laopo |
---|---|---|
地 | de dì | de5 |
差 | chà chā | cha1 |
薄 | báo bó | bao2 |
斗 | dòu dǒu | dou3 |
识 | shí shì | shi2 |
According to the Unihan docs the latter of these pairs are Taiwanese variations. I'm not sure I buy that. The 地 hanzi is the 46th most frequent in the giga-zh list, almost certainly because it's a grammar particle. When used as a particle, I think it's always pronounced de5. My (mainland) wife says so anyway, and this indicates Taiwanese also say de: https://forvo.com/word/%E5%9C%B0%EF%BC%88de%EF%BC%89/#zh
It's obviously important we get 地 right when it's used as a particle...how should we handle that?
Yeah my guess is the Unihan people didn't give any consideration to the different meanings in the case of 地 (or if they did decided it wasn't their problem, "out of scope"), rather the Taiwanese source(s) they used said that di4 (i.e. the noun meaning) is more common for whatever reason, so it was simply included. Keep in mind that the readings given are almost certainly based on how often it appears in compounds too, whereas for our purposes we're mostly only considering standalone meanings, since compounds should (ideally) be covered by the word dictionary.
Ultimately automated Chinese word segmentation and transliteration are a hard problem (it's an active area of research) because it so often depends on contextual meaning, so whatever approach that is taken is going to be flawed in some way, it's always going to be "best effort".
Anyway these kinds of discrepancies are probably best solved by adding "manual override" data, so basically what you were already doing, but only for characters where we think the kMandarin
readings aren't ideal. I think I agree with your wife about the more likely readings for 差 and 斗, for what it's worth given my limited knowledge.
OK, here's the subset of polyphones.tsv where my wife's pick differs from kMandarin (or kMandarin is a pair). I'm going to ask her to double-check these since it's a small set. So basically I'll update polyphones to this (including any adjustments later), and then add short-circuiting for single character words to use hanzi's kMandarin instead of cidian?
There's also about 250 two character words in cidian with ambiguous pronunciations. I'll try to add those in as well because I'm sure some of those meanings are much more commonly used than others.
Surprisingly, it took me a while to find a few two-character polyphones that didn't already transcribe the same way as my wife picked. So there are some possibly redundant two-character polyphones in the override table, but that could change if SQLite query optimization changes or an ORDER BY clause gets added (as it probably should be).
I actually ran the tests before those last commits. They all pass for me.
This workaround should address most problems mentioned in https://github.com/luoliyan/chinese-support-redux/issues/173 . It's certainly not perfect, but it will mean having to correct tones in simple sentences a lot less often. A further update could add two syllable polyphones. That should only require adding them to the polyphones.tsv file.
I guess I could have put the polyphones into the SQLite file but that seemed like overkill. I didn't actually run the tests. I developed this entirely by editing the addin python files and seeing what happened when I ran Anki. :)