mholtzscher / syllapy

Calculate syllable count for English words.
MIT License
34 stars 9 forks source link

syllable count seems not always correct #94

Open icezee opened 4 years ago

icezee commented 4 years ago

import syllapy syllapy.count('feature') 2 syllapy.count('features') 3

Hathaway2010 commented 3 years ago

I found a pretty good way of dealing with "es" and "ed" endings (and a couple other issues) using regular expressions! I'm extremely new to open-source, though — are you open to pull requests now?

(I'm thinking of using syllapy or something of the sort in a poetry analysis app!)

eyaler commented 2 years ago

@Hathaway2010 can you share your solution?

hszhai commented 2 years ago

I later switched to 'pronouncing' https://github.com/aparrish/gen-text-workshop/blob/master/cmu_pronouncing_dictionary_notes.md

I'll update the repo and add a license so it becomes more useful. Thanks for the feedback

Hathaway2010 commented 2 years ago

@eyaler belatedly: Here's a short and nasty one:

def syllables(word):
    """Guess syllable count of word not in database.
    Parameters
    ----------
    word : str
        word not found in database

    Returns
    -------
    count : int
        estimated number of syllables

    See also
    --------
    tests/test_scan.py to clarify regular expressions
    """
    vowels_or_clusters = re.compile("[AEÉIOUaeéiouy]+")
    vowel_split = re.compile("[aiouy]é|ao|eo[^u]|ia[^n]|[^ct]ian|iet|io[^nu]|[^c]iu|[^gq]ua|[^gq]ue[lt]|[^q]uo|[aeiouy]ing|[aeiou]y[aiou]") # exceptions: Preus, Aida, poet, luau
    final_e = re.compile("e$")
    silent_final_ed_es = re.compile("[^aeiouydlrt]ed$|[^aeiouycghjlrsxz]es$|thes$|[aeiouylrw]led$|[aeiouylrw]les$|[aeiouyrw]res$|[aeiouyrw]red$")
    lonely = re.compile("[^aeiouy]ely$")
    audible_final_e = re.compile('[^aeiouylrw]le$|[^aeiouywr]re$|[aeioy]e|[^g]ue')
    word_lower = word.lower()
    voc = re.findall(vowels_or_clusters, word_lower)
    count = len(voc)
    if final_e.search(word_lower) and not audible_final_e.search(word_lower):
        count -= 1
    if silent_final_ed_es.search(word_lower) or lonely.search(word_lower):
        count -= 1
    likely_splits = re.findall(vowel_split, word_lower)
    if likely_splits:
        count += len(likely_splits)
    if count == 0:
        count += 1
    return count

I wound up using this to guess any words not in Webster's Unabridged Dictionary from 1913, downloaded from Project Gutenberg and parsed into a database. Neither the dictionary nor this function is remotely infallible (the dictionary thinks the word "every" has three syllables, and the function doesn't know how to distinguish between "seneschal" -- three syllables -- and "sometimes" -- two), but I do think it's a refinement. I got the basic approach from syllapy and would be delighted to contribute this back to the repo :) If you want to see an expanded version that makes stronger efforts to be human readable, you can check out https://github.com/Hathaway2010/poetry-meter/blob/95d5fdbe7ffb8cde2191b4fd417010240060ea05/recurse_final.py#L89

Hathaway2010 commented 2 years ago

I later switched to 'pronouncing' https://github.com/aparrish/gen-text-workshop/blob/master/cmu_pronouncing_dictionary_notes.md

I'll update the repo and add a license so it becomes more useful. Thanks for the feedback

"Pronouncing" looks splendid :) I should be using this too probably.

eyaler commented 2 years ago

i am using this table for some manual fixes: https://raw.githubusercontent.com/harrisj/nyt-haiku-python/master/nyt_haiku/data/syllable_counts.csv

in @mholtzscher writeup for syllapy: https://mholtzscher.github.io/2018/05/29/syllables/ he mentions: "The closest thing I found was the CMU Pronouncing Dictionary. However, this database shows the phonemes for the words rather than syllables. In some cases the phonemes align with syllables but this is not always the case."

maybe @mholtzscher can advise regarding the issues you saw with CMU?

peterchinman commented 3 months ago

I know this is two years later, but I am curious about @mholtzscher phoneme/syllable misalignments. I couldn't think of an example where counting the arpabet vowels from the cmudict didn't give an accurate syllable count. (Though there are some instances where there are competing syllable counts for different pronunciations.)

mholtzscher commented 3 months ago

hi @peterchinman I can't recall the exact issues I ran into with cmu but if I remember correctly it was that cmu usually had more phonemes than syllables for some words. So for the work I was doing in readability this would greatly affect the readability scores as it would inflate the syllable count.