w3c / ruby-t2s-req

Text to Speech of Electronic Documents Containing Ruby: User Requirements
https://w3c.github.io/ruby-t2s-req/
Other
0 stars 4 forks source link

Possible readings of a kanji character #7

Closed murata2makoto closed 8 months ago

murata2makoto commented 2 years ago

@aleventhal wrote as a comment to another issue in a different repository:

MM>Their implementations examine code points of base characters.

Can we get an understanding of where this heuristic falls down? Or, if it's highly accurate, then do we actually need markup to differentiate between para-ruby and general-ruby?

This web page by a Japanese ministry shows which kanji can be read how. But for pedagogical reasons, it oversimplifies the mess of kanji. For example, consider 生, which is for the first grade in Japanese elementary schools. Only 12 phonetics of this kanji character are listed.

But the reality is different. Difficulties in reading a kanji character in a particular context do not necessarily relate to the difficulties of that kanji character.

There are more than 100 ways of reading this character: 生きる(ikiru), 生える(haeru), 生む(umu), 先生(sensei), 生も の(nama mono), 生糸(kiito), 生い立ち(oitachi), 弥生(yayoi), 生憎 (ainiku), 生さぬ仲(nasanu naka), 苔の生すまで(kokeno musumade) , 生簀 (ikesu), 早生(wase), 晩生(okute), 芝生(shibafu), 生業(nariwai), 生粋 (kissui), and so forth. If we consider proper names such as 福生, 羽生, 生保内, and 壬生, things will become even more difficult. I do not believe that the required heuristics will be written down and implemented in the near future.

This page shows which kanji is taught in which grade in K12. A DAISY reader uses this list for hiding kanji characters below the specified grade. But if we really want to mimic printed textbooks, we will have to know which kanji is taught in which semester. Different textbooks teach kanji characters in different orders.

murata2makoto commented 8 months ago

To my surprise, ChatGPT (V3.5) is already quite good at reading the above examples!

murata2makoto commented 8 months ago

This issue does not contain proposed changes. Moreover, as demonstrated by ChatGPT, it is possible to handle many possible readings of 生 automatically. I will close this issue.