IPA dictionary for German

devio-at commented 1 year ago

I found that the German dictionary has some issues.

I therefore created a script to extract German IPA from Wiktionary, and would like to submit the generated data files.

Please see my repo for more information: https://github.com/devio-at/german-ipa-dict

dohliam commented 1 year ago

@devio-at Thank you for taking the time to examine the German data and write up your findings!

It is not surprising that you were able to find errors in the dictionary -- this is of course the downside of automatically-generated data, and also the reason that the Readme lists all dictionaries generated this way as "experimental" until they have been checked manually by a human.

I went back to check on the original repository for the German IPA generation script, but as you noticed, it seems to no longer exist. If you are interested in knowing how the script came to generate inaccurate output (and I certainly am!), I forked the repository a few years ago, so you can still examine the code and the logic that it was using.

Thanks also for generating the CSV files from Wiktionary. These look much better as a replacement for the current de dictionary. I think the next step would be to merge and de-duplicate the two files (de_dewikt.csv and de_enwikt.csv) so they can be added to the repo. They will need some cleaning up though for some minor issues before including them.

Some things I noticed:

There are a large number of entries in the German Wiktionary output with blank pronunciation sections (//). A lot of these seem to be proper nouns, though there are also some longer phrases (e.g., alles fit im Schritt) and obscure or technical words (e.g., troglophil). I assume the best approach would be to just filter these out or delete them with a regex.
The data from the English Wiktionary has quite a few usage notes that would need to be stripped somehow (e.g., Opportunität,/ˌɔpɔʁtuniˈtɛːt/,"qualifier:standard; used naturally in western Germany and Switzerland". Assuming these are all in the third column of the CSV file (I haven't checked to confirm this) it shouldn't be too difficult to do this.
There are many template inclusions that show up in the data (e.g., Urte,/ˈʊʁtəs/,"{{Pl.2}}"). Presumably most of these can be removed by searching text between curly braces.
Another thing to look out for is HTML entities that show up in the entries. These are a bit trickier because they don't seem to be restricted to the third column of the file (e.g., tosen,/ˈtoːzn̩/)
Some oddly-formatted entries -- these might need to be checked individually. Example:

Zubrötchen,"[ˈtsuːˌbr̺øːtçən]&lt;ref&gt;In Natal German the pronunciation of the consonantal /r/ is realized [[w:Apical consonant,apically]] ([r̺]) (cf. Hildegard Irma Stielau: ''Nataler Deutsch: Eine Dokumentation unter besonderer Berücksichtigung des englischen und afrikaansen Einflusses auf die deutsche Sprache in Natal'', Franz Steiner Verlag, Wiesbaden 1980 (Deutsche Sprache in Europa und Übersee. Berichte und Forschungen ; volume 7),","ISBN:3-515-02635-5, page 9.)&lt;/ref&gt;"

devio-at commented 1 year ago

@dohliam Thank you for your feedback, I'll have a look at the issues

devio-at commented 1 year ago

@dohliam it took me a while, but I have now (hopefully!) succeeded in generating a "clean" de_dewikt.csv file.

Regarding your second issue, I created the 3rd column intentionally so as to include any remarks (regional, dialect, usage) given for a pronunciation (alternative). Do you consider a 3rd column problematic?

dohliam commented 1 year ago

@devio-at That's great news, thanks for doing this! :+1:

The de_dewikt.csv file is significantly larger than de_enwikt.csv now. I assume that means all of the unique entries in de_enwikt.csv have now been merged into de_dewikt.csv? If not, I can compare the two files and merge any additional entries in.

Do you consider a 3rd column problematic?

Definitely not! The additional remarks sound useful, so it makes that you kept them. For the purposes of the ipa-dict project, though, the specific file format we are using is tab-separated with only two columns, and any additional pronunciations included in a comma-separated list in the second column. So only the first two columns will be extracted for use here.

(On the other hand, it would be really interesting to collect together any regional pronunciations into separate dictionaries for each region -- let me know if this sounds feasible!)

On a quick glance through the file, it looks like there are quite a few duplicate entries where graphemes with different pronunciations have been listed on different lines. In keeping with the format mentioned above, these will be merged so that each grapheme appears only once alongside all possible pronunciations.

dohliam commented 1 year ago

Ok, I managed to extract the data from de_dewikt.csv, and have updated the repo (with a credit to you for the data). This looks to be a big improvement, so thank you again for your work on this!

I ended up writing a script to remove entries with comments, selectively re-import them, and then merge all the duplicate entries.

For now, I have temporarily labelled the resulting file de.txt, even though it should more properly be named de_DE.txt, since it currently reflects the standard in Germany only. What I would like to do eventually is to rename the file to de_DE.txt, and then create several new files -- at a minimum de_AT.txt (for Austria), and de_CH.txt (for Switzerland), containing only pronunciations for those locales.

Part of this work can be done using the commented lines from your CSV file (for example, it should be trivial to extract all the lines labelled Österreich or österreichisch and dump them to de_AT.txt), but it would definitely benefit from some expert assistance to ensure that the resulting files are correct, and also to make a decision about how much of the "main" vocabulary to merge into the localized variants. For example, I assume that where readings are labelled österreichisch auch, they should be merged with the main pronunciation -- but as a second option, not the first.

Another important task would be to create databases for the sub-national varieties, but that seems like it might be less straightforward and we might need a better (more thorough) data source.

Anyway, let me know what you think. I'll close this issue for now and we can open new issues to work on the other varieties. In the meantime the data should hopefully be more useful than the old experimental file!

open-dict-data / ipa-dict

IPA dictionary for German #37