Closed MarvinLvn closed 5 years ago
the next step is to create a phonological version to count syllables to do so, we could rely on the phonemizer package: https://github.com/bootphon/phonemizer For English, the FESTIVAL option will return syllable boundaries as ";esyll"
$ echo "hello world" | phonemize -l en-us-festival
An alternative for languages where only vowels are nuclei is to use the following code: https://github.com/laiafr/SegCatSpa/blob/master/Analysis_Pipeline/CatalanSpanish_recipe/catspa-syllabify-corpus.pl
it requires a list of vowels, which need to be provided for each language but my guess is that for both tseltal and spanish this list will do: aeiou
It also requires a list of "permissible onsets". The way to generate this is by:
Update : Done
Example of conversion, from :
37461442 37462869 okay, I'm gonna refill mine then. MA1
to :
37461442 37462869 MA1 okay I'm gonna refill mine then 6 owk-ey- aym- gaan-ax- riyf-ihl- mayn- dhehn- 9
where 6 indicates the number of words, and 9 the number of syllables. It should handle human-made errors, and it should remove all the non-relevant information. Let me know if you notice any bugs (like things that shouldn't be removed, or things that should be removed from the original eaf transcription).
The script which converts a folder containing eaf files to their enriched txt version is tools/eaf2enriched_txt.sh. It should be easy to use (let me know if there is something unclear!).
Next step :
Implementing the following pipeline :
[.eaf] -> [csv. With onset, offset, ortho from.eaf, speakertier] -> [cleanup for .csv] -> [word counts + syllable counts]
Okko’s matlab scripts can be used to do current cleanup.