Add support for detecting English phonemes

prosegrinder / python-prosegrinder

A relatively fast, functional prose text counter with readability scoring.

GNU General Public License v3.0

3 stars 2 forks source link

Add support for detecting English phonemes #12

Closed yvlcmb closed 2 years ago

yvlcmb commented 2 years ago

Adding a feature to prosegrinder to enable the detection of English phonemes might be really useful.

There are over 40 phonemes in English despite only 26 letters in its alphabet - here's one source: https://www.dyslexia-reading-well.com/44-phonemes-in-english.html.

I don't know how this would work, perhaps implemented as a new 'get_phoneme' method to the Dictionary class in the dictionary.py module. It would be nice if the feature could not only count the total number of phonemes but also output an iterable containing the specific phonemes that occur in a section of text/prose.

davidlday commented 2 years ago

This is doable but might take me a little while to get to it.

Here are some possible existing packages for phonemes (quick search for 1.x+ versioned packages):

I'll probably wind up incorporating one of these to do the underlying work, so if you test them out or have an opinion on which one works well for your use cases, let me know here.

yvlcmb commented 2 years ago

Thanks for the recommendations, I had never heard of either of those, they both look pretty capable, I'll try them out.

davidlday commented 2 years ago

Turns out CMUdict is still probably the best source of phones.

phonemizer has external (i.e. non-python) dependencies, which I want to avoid at all costs.
gruut is self-contained, but a quick query on it's lexicon.db shows only 128870 entries. The CMUdict contains 135115 entries.

gruut does apparently have the ability to guess the pronunciation of words not in its lexicon, but I might be able to add something similar later. For now, I'm going to start with just cmudict and see how far that goes.

davidlday commented 2 years ago

@slingload - I have a branch called phones that adds minimal support for this. If you get some time, would you test it out and let me know if it's close to what you need, please?

yvlcmb commented 2 years ago

I cloned phones and tested it out, seems to work nicely! Here's what I did, let me know if I was using it incorrectly:

>>> from prosegrinder import Prose
>>> quotes = [
    ...: "All that glitters is not gold.",
    ...: "Hell is empty and all the devils are here.",
    ...: "Uneasy lies the head that wears a crown.",
    ...: ]
>>> text = ' '.join(quotes)
>>> p = Prose(text)
>>> p.phone_count
73
>>> p.phone_frequency
{'AO': 2,
 'L': 7,
 'DH': 4,
 'AE': 2,
 'T': 5,
 'G': 2,
 'IH': 3,
 'ER': 1,
 'Z': 7,
 'N': 4,
 'AA': 2,
 'OW': 1,
 'D': 4,
 'HH': 3,
 'EH': 5,
 'M': 1,
 'P': 1,
 'IY': 4,
 'AH': 6,
 'V': 1,
 'R': 4,
 'AY': 1,
 'W': 1,
 'K': 1,
 'AW': 1}

This is exactly the kind of easy interface I was hoping for, well done!

davidlday commented 2 years ago

Closed by #14