srvk / DiViMe

ACLEW Diarization Virtual Machine
Apache License 2.0
32 stars 9 forks source link

Request for general eaf2txt #120

Closed marisacasillas closed 5 years ago

marisacasillas commented 5 years ago

In its current form, eaf2txt.py is specific to Okko's needs. It'd be useful to have a more general one that mimics the native ELAN tab-delimited output. In a nutshell, the current converter script gives the following columns for each non-CHI utterance:

onset offset xds transcription speaker
0 20 A hello. MA1
45 80 B hey. FA1
45 78 T baby. MA1

Instead what I'd like to see for a more general use case is one of the following two options:

Option 1: a row for each utterance...

speaker onset offset xds vcm lex mwu transcription
MA1 0 20 A NA NA NA hello.
FA1 45 80 B NA NA NA hey.
MA1 45 75 T NA NA NA baby.
CHI 100 160 NA C W M dis [: what's this] Daddy?

(where there are NAs if there is no value of that type)

Option 2: a row for each annotation... (this is the actual ELAN mimic case)

speaker tier onset offset value
MA1 MA1 0 20 hello.
MA1 xds@MA1 0 20 A
FA1 FA1 45 80 hey.
FA1 xds@FA1 45 80 B
MA1 MA1 45 75 baby.
MA1 xds@MA1 45 75 T
CHI CHI 100 160 dis [: what's this] Daddy?
CHI vcm@CHI 100 160 C
CHI lex@CHI 100 160 W
CHI mwu@CHI 100 160 M

(in this case, value gets the empty string when there's nothing there)

fmetze commented 5 years ago

do you have an example eaf file with desired output?

On Apr 30, 2019, at 6:17 AM, Marisa Casillas notifications@github.com wrote:

In its current form, eaf2txt.py is specific to Okko's needs. It'd be useful to have a more general one that mimics the native ELAN tab-delimited output. In a nutshell, the current converter script gives the following columns for each non-CHI utterance:

onset offset xds transcription speaker 0 20 A hello. MA1 45 80 B hey. FA1 45 78 T baby. MA1 Instead what I'd like to see for a more general use case is one of the following two options:

Option 1: a row for each utterance...

speaker onset offset xds vcm lex mwu transcription MA1 0 20 A NA NA NA hello. FA1 45 80 B NA NA NA hey. MA1 45 75 T NA NA NA baby. CHI 100 160 NA C W M dis [: what's this] Daddy? (where there are NAs if there is no value of that type)

Option 2: a row for each annotation... (this is the actual ELAN mimic case)

speaker tier onset offset value MA1 MA1 0 20 hello. MA1 xds@MA1 0 20 A FA1 FA1 45 80 hey. FA1 xds@FA1 45 80 B MA1 MA1 45 75 baby. MA1 xds@MA1 45 75 T CHI CHI 100 160 dis [: what's this] Daddy? CHI vcm@CHI 100 160 C CHI lex@CHI 100 160 W CHI mwu@CHI 100 160 M (in this case, value gets the empty string when there's nothing there)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/DiViMe/issues/120, or mute the thread https://github.com/notifications/unsubscribe-auth/ABE4B4O3VCIB7F75Q3DQA23PTAMDZANCNFSM4HJKIY6A.

marisacasillas commented 5 years ago

Example .eaf (i.e., input file): https://www.dropbox.com/s/ioayaw8p1d35ea5/2337-0GS0.eaf?dl=0 Example .txt (i.e., output file): https://www.dropbox.com/s/jt0j14fwygej9kc/2337-0GS0.txt?dl=0

This is an example of Option 2, as described above, where each annotation (not each utterance!) gets a row in the output file. It is parallel to what you get if you manually export a tab-delimited text from ELAN and so should minimize confusion from people using both techniques to convert within a single lab.

fmetze commented 5 years ago

Pull the latest version and try, I added code that can produces output of the first kind. You can run it with:

python utils/eaf2txt.py -i 2337-0GS0.eaf -f marisa

Let me know what you want the name of the format switch to be, right now they are "-f marisa" and "-f okko" (the default) - probably there are better names.

The example that you gave me seems to be file format v3.0, which the pympi library cannot process. I manually edited 2337-0GS0.eaf to say "version 2.8", and it seems to work, but that is probably not really sufficient. We may need to do more testing.

fmetze commented 5 years ago

no more activity, closing for now