techczech / phonicsengine

A phonics API for the English language.
10 stars 0 forks source link

Origins of the file "phonics_engine_dictionary_english-v7.csv"? #1

Open rbracco opened 4 years ago

rbracco commented 4 years ago

First off, thank you so much, you are the first person I've seen hosting a dictionary that corresponds graphemes to phonemes in common words. I have only been able to find word level transcriptions, but no linkage between combinations of letters and phones.

I was wondering, since this is quite a hard problem, what the origins of this dictionary are and how it was generated. Thank you!

techczech commented 3 years ago

Hi Robert,

sorry for the late reply. We had to create the dictionary manually because no such thing existed. See the associated paper on phonics engine. https://www.researchgate.net/publication/280147388_Building_a_Phonics_Engine_for_Automated_Text_Guidance

The process was starting with a dictionary with pronunciation at word level and then creating a bunch of equivalence rules to automatically do the matching. After that we had to manually review samples to see where additional rules were needed, lot of regexp work. We had plans to do more but didn't have funding.

Here are some examples of the rules we used: https://docs.google.com/document/d/1-DHwHyeaZwdo_ZjDSwWXe0TwE-GfdKJ7S2xyFsoyVkw/edit?usp=sharing

Dominik


Dominik Lukes http://dominiklukes.net @techczech

On Fri, Aug 7, 2020 at 8:23 PM Robert Bracco notifications@github.com wrote:

First off, thank you so much, you are the first person I've seen hosting a dictionary that corresponds graphemes to phonemes in common words. I have only been able to find word level transcriptions, but no linkage between combinations of letters and phones.

I was wondering, since this is quite a hard problem, what the origins of this dictionary are and how it was generated. Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/techczech/phonicsengine/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG44C7YTCK3LI773L7JBSTR7RIE7ANCNFSM4PX5VTIQ .

rbracco commented 3 years ago

Thanks so much for sharing. I work in machine learning and may try to build a model that does the letter/phone correspondence so that we could expand the dictionary to any word. If I make any progress I'll be sure to let you know.

Best, Rob

On Thu, Sep 24, 2020 at 2:19 AM techczech notifications@github.com wrote:

Hi Robert,

sorry for the late reply. We had to create the dictionary manually because no such thing existed. See the associated paper on phonics engine.

https://www.researchgate.net/publication/280147388_Building_a_Phonics_Engine_for_Automated_Text_Guidance

The process was starting with a dictionary with pronunciation at word level and then creating a bunch of equivalence rules to automatically do the matching. After that we had to manually review samples to see where additional rules were needed, lot of regexp work. We had plans to do more but didn't have funding.

Here are some examples of the rules we used:

https://docs.google.com/document/d/1-DHwHyeaZwdo_ZjDSwWXe0TwE-GfdKJ7S2xyFsoyVkw/edit?usp=sharing

Dominik


Dominik Lukes http://dominiklukes.net @techczech

On Fri, Aug 7, 2020 at 8:23 PM Robert Bracco notifications@github.com wrote:

First off, thank you so much, you are the first person I've seen hosting a dictionary that corresponds graphemes to phonemes in common words. I have only been able to find word level transcriptions, but no linkage between combinations of letters and phones.

I was wondering, since this is quite a hard problem, what the origins of this dictionary are and how it was generated. Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/techczech/phonicsengine/issues/1, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAG44C7YTCK3LI773L7JBSTR7RIE7ANCNFSM4PX5VTIQ

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/techczech/phonicsengine/issues/1#issuecomment-698138481, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALIBGAJ5F63CWYI7DH5D34LSHLQF5ANCNFSM4PX5VTIQ .

techczech commented 3 years ago

Thanks, that would be great. Keep me posted.

Dominik

On Thu, Sep 24, 2020, 15:44 Robert Bracco notifications@github.com wrote:

Thanks so much for sharing. I work in machine learning and may try to build a model that does the letter/phone correspondence so that we could expand the dictionary to any word. If I make any progress I'll be sure to let you know.

Best, Rob

On Thu, Sep 24, 2020 at 2:19 AM techczech notifications@github.com wrote:

Hi Robert,

sorry for the late reply. We had to create the dictionary manually because no such thing existed. See the associated paper on phonics engine.

https://www.researchgate.net/publication/280147388_Building_a_Phonics_Engine_for_Automated_Text_Guidance

The process was starting with a dictionary with pronunciation at word level and then creating a bunch of equivalence rules to automatically do the matching. After that we had to manually review samples to see where additional rules were needed, lot of regexp work. We had plans to do more but didn't have funding.

Here are some examples of the rules we used:

https://docs.google.com/document/d/1-DHwHyeaZwdo_ZjDSwWXe0TwE-GfdKJ7S2xyFsoyVkw/edit?usp=sharing

Dominik


Dominik Lukes http://dominiklukes.net @techczech

On Fri, Aug 7, 2020 at 8:23 PM Robert Bracco notifications@github.com wrote:

First off, thank you so much, you are the first person I've seen hosting a dictionary that corresponds graphemes to phonemes in common words. I have only been able to find word level transcriptions, but no linkage between combinations of letters and phones.

I was wondering, since this is quite a hard problem, what the origins of this dictionary are and how it was generated. Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/techczech/phonicsengine/issues/1, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AAG44C7YTCK3LI773L7JBSTR7RIE7ANCNFSM4PX5VTIQ

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/techczech/phonicsengine/issues/1#issuecomment-698138481 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ALIBGAJ5F63CWYI7DH5D34LSHLQF5ANCNFSM4PX5VTIQ

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/techczech/phonicsengine/issues/1#issuecomment-698390713, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG44C47YTRL2AQJ3PUMU5DSHNLL3ANCNFSM4PX5VTIQ .