sozysozbot / korean_hanja_sound

Data of korean hanja sound, taken from KS X 1001
https://sozysozbot.github.io/korean_hanja_sound/index.html
3 stars 1 forks source link

Dealing with hanjas having multiple readings #1

Open dahlia opened 5 years ago

dahlia commented 5 years ago

Although Korean hanja have only a single reading for the most part, there are some exceptional ones as well, e.g.:

There are still many more, but anyway I believe it could be better at dealing with them if this software leverages KS X 1002 as well. FYI there is a property named kHangul in Unicode Han Database, which covers readings in both KS X 1001 and KS X 1002. (I made a JSON version of it as well.)

Thanks!

sozysozbot commented 5 years ago

Thank you for your comment. The current implementation is based on the fact that KS X 1001 encodes different readings of the same hanja separately, and the fact that the round-trip conversion between Unicode and KS X 1001 is possible. Hence it handles 「樂(U+6A02)」, 「樂(U+F914)」, 「樂(U+F95C)」 and 「樂(U+F9BF)」separately, giving 악, 낙, 락, 요 respectively. Thus all I have to do will be to NFC everything; however, the original purpose of the script was to actually handle non-normalized texts, and I have not had a chance to fix or modify the current behavior. I haven't thought about KS X 1002, and I believe your JSON version can be quite helpful for me if I would ever need to handle it. Thanks for the information.