Open icezee opened 4 years ago
I found a pretty good way of dealing with "es" and "ed" endings (and a couple other issues) using regular expressions! I'm extremely new to open-source, though — are you open to pull requests now?
(I'm thinking of using syllapy or something of the sort in a poetry analysis app!)
@Hathaway2010 can you share your solution?
I later switched to 'pronouncing' https://github.com/aparrish/gen-text-workshop/blob/master/cmu_pronouncing_dictionary_notes.md
I'll update the repo and add a license so it becomes more useful. Thanks for the feedback
@eyaler belatedly: Here's a short and nasty one:
def syllables(word):
"""Guess syllable count of word not in database.
Parameters
----------
word : str
word not found in database
Returns
-------
count : int
estimated number of syllables
See also
--------
tests/test_scan.py to clarify regular expressions
"""
vowels_or_clusters = re.compile("[AEÉIOUaeéiouy]+")
vowel_split = re.compile("[aiouy]é|ao|eo[^u]|ia[^n]|[^ct]ian|iet|io[^nu]|[^c]iu|[^gq]ua|[^gq]ue[lt]|[^q]uo|[aeiouy]ing|[aeiou]y[aiou]") # exceptions: Preus, Aida, poet, luau
final_e = re.compile("e$")
silent_final_ed_es = re.compile("[^aeiouydlrt]ed$|[^aeiouycghjlrsxz]es$|thes$|[aeiouylrw]led$|[aeiouylrw]les$|[aeiouyrw]res$|[aeiouyrw]red$")
lonely = re.compile("[^aeiouy]ely$")
audible_final_e = re.compile('[^aeiouylrw]le$|[^aeiouywr]re$|[aeioy]e|[^g]ue')
word_lower = word.lower()
voc = re.findall(vowels_or_clusters, word_lower)
count = len(voc)
if final_e.search(word_lower) and not audible_final_e.search(word_lower):
count -= 1
if silent_final_ed_es.search(word_lower) or lonely.search(word_lower):
count -= 1
likely_splits = re.findall(vowel_split, word_lower)
if likely_splits:
count += len(likely_splits)
if count == 0:
count += 1
return count
I wound up using this to guess any words not in Webster's Unabridged Dictionary from 1913, downloaded from Project Gutenberg and parsed into a database. Neither the dictionary nor this function is remotely infallible (the dictionary thinks the word "every" has three syllables, and the function doesn't know how to distinguish between "seneschal" -- three syllables -- and "sometimes" -- two), but I do think it's a refinement. I got the basic approach from syllapy and would be delighted to contribute this back to the repo :) If you want to see an expanded version that makes stronger efforts to be human readable, you can check out https://github.com/Hathaway2010/poetry-meter/blob/95d5fdbe7ffb8cde2191b4fd417010240060ea05/recurse_final.py#L89
I later switched to 'pronouncing' https://github.com/aparrish/gen-text-workshop/blob/master/cmu_pronouncing_dictionary_notes.md
I'll update the repo and add a license so it becomes more useful. Thanks for the feedback
"Pronouncing" looks splendid :) I should be using this too probably.
i am using this table for some manual fixes: https://raw.githubusercontent.com/harrisj/nyt-haiku-python/master/nyt_haiku/data/syllable_counts.csv
in @mholtzscher writeup for syllapy: https://mholtzscher.github.io/2018/05/29/syllables/ he mentions: "The closest thing I found was the CMU Pronouncing Dictionary. However, this database shows the phonemes for the words rather than syllables. In some cases the phonemes align with syllables but this is not always the case."
maybe @mholtzscher can advise regarding the issues you saw with CMU?
I know this is two years later, but I am curious about @mholtzscher phoneme/syllable misalignments. I couldn't think of an example where counting the arpabet vowels from the cmudict didn't give an accurate syllable count. (Though there are some instances where there are competing syllable counts for different pronunciations.)
hi @peterchinman I can't recall the exact issues I ran into with cmu but if I remember correctly it was that cmu usually had more phonemes than syllables for some words. So for the work I was doing in readability this would greatly affect the readability scores as it would inflate the syllable count.