srvk / DiViMe

ACLEW Diarization Virtual Machine
Apache License 2.0
32 stars 9 forks source link

eaf to okko's format #63

Closed MarvinLvn closed 5 years ago

MarvinLvn commented 6 years ago

Implementing the following pipeline :

[.eaf] -> [csv. With onset, offset, ortho from.eaf, speakertier] -> [cleanup for .csv] -> [word counts + syllable counts]

Okko’s matlab scripts can be used to do current cleanup.

alecristia commented 5 years ago

the next step is to create a phonological version to count syllables to do so, we could rely on the phonemizer package: https://github.com/bootphon/phonemizer For English, the FESTIVAL option will return syllable boundaries as ";esyll"

$ echo "hello world" | phonemize -l en-us-festival

An alternative for languages where only vowels are nuclei is to use the following code: https://github.com/laiafr/SegCatSpa/blob/master/Analysis_Pipeline/CatalanSpanish_recipe/catspa-syllabify-corpus.pl

it requires a list of vowels, which need to be provided for each language but my guess is that for both tseltal and spanish this list will do: aeiou

It also requires a list of "permissible onsets". The way to generate this is by:

  1. get all the words in the corpus (eg tr ' ' 'n' | sort | uniq)
  2. remove everything following the first vowel in the word (eg sed 's/[aeiou].*//g')
MarvinLvn commented 5 years ago

Update : Done

Example of conversion, from :

37461442 37462869 okay, I'm gonna refill mine then. MA1

to :

37461442 37462869 MA1 okay I'm gonna refill mine then 6 owk-ey- aym- gaan-ax- riyf-ihl- mayn- dhehn- 9

where 6 indicates the number of words, and 9 the number of syllables. It should handle human-made errors, and it should remove all the non-relevant information. Let me know if you notice any bugs (like things that shouldn't be removed, or things that should be removed from the original eaf transcription).

The script which converts a folder containing eaf files to their enriched txt version is tools/eaf2enriched_txt.sh. It should be easy to use (let me know if there is something unclear!).

Next step :