tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
Other
785 stars 82 forks source link

Loss of pronunciations for Chinese languages #65

Closed hdavidethan closed 2 years ago

hdavidethan commented 3 years ago

I just extracted all of the data under Chinese and noticed that many lack pronunciation/sounds information. In some cases, when a pronunciation is listed, the information containing the origin/language of the pronunciation is lost.

For instance, the dictionary page for 阿姨 and the JSON data shows only 3 IPA pronunciations: /a²⁴ i¹¹/, /a²⁴ (j)i¹¹/, /a³³⁻²³ i⁵⁵/. Comparing this with Wiktionary, these three pronunciations come from 3 different entries: Hakka (Northern Sixian dialect), Hakka (Southern Sixian dialect), and Min Nan (Teochew dialect). There is no distinction between the dialect/language of origin of either of the IPA entries. On Wiktionary, there are 8 entries from 4 different languages.

Unfortunately, multiple Sinitic languages are combined into the same entries since they all share characters. I also noticed that the Wiktionary source only contains the corresponding romanizations for each language (using the zh-pron module), i.e. Mandarin is encoded using Pinyin, Cantonese with Jyutping, etc. The zh-pron module then automatically generates alternative romanizations used in different areas and the corresponding IPA for the given dialects of each language. This information is available when you click on the more toggle button in the pronunciation box.

Here are some images to illustrate this issue:

image

Compare the above image to the Wiktionary page:

image

and the Wiktionary source using zh-pron

image

It would be nice to allow the corresponding romanization schemes of these languages to be included in the dump and let users convert these romanizations to the corresponding language/dialect's IPA or somehow incorporate the information listed by the zh-pron module to show which language/dialect each IPA entry comes from?

tatuylonen commented 3 years ago

Properly capturing Chinese pronunciations is on my TODO list. Unfortunately I have a number of other issues on my TODO list as well, some of them urgent. I'm estimating it will be a few weeks but not months before I get to these. The work itself is probably a few days. I can try to prioritize it earlier if you need it urgently.

hdavidethan commented 3 years ago

Thanks for the reply! I just saw the item on the TODO list (oops). I understand that you have other things to prioritize. I don't really need it soon, so no worries if you can't prioritize it earlier. I'll just monitor this issue for updates.

longjiang commented 2 years ago

Properly capturing Chinese pronunciations is on my TODO list. Unfortunately I have a number of other issues on my TODO list as well, some of them urgent. I'm estimating it will be a few weeks but not months before I get to these. The work itself is probably a few days. I can try to prioritize it earlier if you need it urgently.

Thanks so much. I've created an open-source language-learning project that helps people learn all languages of the world, includes all varieties of Chinese. For example:

Currently these are still using modern Chinese pronunciations. But if I can get all varieties of Chinese pronunciations the dictionary will be a lot more useful. Without your project zerotohero.ca would not have been possible. Thank you!!!!

kevinsung commented 2 years ago

@tatuylonen This issue is a high priority for me so I'm interested in tackling it myself. If you already have an idea of what needs to be done, any hints would be helpful.

tatuylonen commented 2 years ago

I'll try to get to this soon (I've been prioritizing inflection table extraction, but will soon be able to focus on other things again). This is next on my list.

yoskari commented 2 years ago

I have implemented a preliminary fix. I will improve the presentation of the data on the site in the next few days. Here is an example: https://kaikki.org/dictionary/All%20languages%20combined/meaning/%E5%8C%85/%E5%8C%85%E8%A2%B1/%E5%8C%85%E8%A2%B1.html