open-dict-data / ipa-dict

Monolingual wordlists with pronunciation information in IPA
https://open-dict-data.github.io/ipa-lookup/
MIT License
544 stars 86 forks source link

Heterophones not in the list? #23

Open sancarn opened 3 years ago

sancarn commented 3 years ago

Heterophones like Tear and Record are not present in the ipa dictionary for English. Is there a specific reason for this as i understand these wouldn't fit neatly into the data structure? Or is it just an oversight?

dohliam commented 3 years ago

Thanks for raising this question, but I'm not sure I quite understand the issue. Heterophones are explicitly included in the data structure, as described in the Readme:

Where multiple possible pronunciations exist for a given entry, they should all be listed (separated by commas), even if they have different senses. For example, the word est has two different pronunciations in French (/ɛst/ and /ɛ/), depending on whether it is a noun or an (unrelated) verb, so the entry for est lists both of these pronunciations.

Furthermore, the entries for tear and record are included in the English dictionary as follows:

tear    /ˈtɛɹ/, /ˈtɪɹ/
record  /ˈɹɛkɝd/, /ɹəˈkɔɹd/, /ɹɪˈkɔɹd/

These seem to pretty unambiguously include the multiple possible pronunciations for each word.

If you are referring to the data in the en_UK list, this list derives from the ipacards project and is somewhat less complete than the cmudict-ipa based dictionary used by en_US. In that case, it is simply a matter of the en_UK dictionary needing further additions/contributions to make up for any missing words. Pull requests to update the dictionary are very welcome!

sancarn commented 3 years ago

If you are referring to the data in the en_UK list, this list derives from the ipacards project and is somewhat less complete than the cmudict-ipa based dictionary used by en_US. In that case, it is simply a matter of the en_UK dictionary needing further additions/contributions to make up for any missing words. Pull requests to update the dictionary are very welcome!

Right yes! Sorry I was as I am British so naturally gravitated towards that list 😛

On the topic of Heterophones, as this would effect my pull request what would the opinion be about including different pronunciations from different accents? For instance grass is pronounced /ɡrɑːs/ in most areas of Britain but /ɡraːs/ in other dialects? Would we include them all?

dohliam commented 3 years ago

@sancarn The more accents/dialects/speech varieties the better! The current approach is to separate these into different dictionaries so that the list for each language variant is, internally speaking, as phonemically consistent as possible. So, just, as I'm hoping someone will generate en_AU and en_IE (not to mention en_SG etc) dictionaries eventually, any and all regional variants from around the UK would also be very welcome as new dictionary lists. (This is all assuming that the distinction between two different pronunciations is indeed a matter of geographic region and not just a case of allophones within the same standard or community.)

In terms of expanding the en_UK dictionary, I took a look at the ipacards repo and if I recall correctly our version is taken from the pre-generated list there which contains about 65K entries.

Looking at the code that generated that list, it appears that heteronyms are intentionally stripped out based on a list of 972 heteronyms (which include both tear and record among others). So it should be possible to regenerate this list with the missing heteronyms by removing those three lines (or just run the script again using the heteronyms file as the main vocabulary source, which might be easier).

If you'd like to give this a try, please go ahead, and I would be happy to accept the resulting PR. I would also be glad to look into this myself but likely won't have a chance to do so until early August at the earliest.