Request for general eaf2txt

marisacasillas commented 5 years ago

In its current form, eaf2txt.py is specific to Okko's needs. It'd be useful to have a more general one that mimics the native ELAN tab-delimited output. In a nutshell, the current converter script gives the following columns for each non-CHI utterance:

onset	offset	xds	transcription	speaker
0	20	A	hello.	MA1
45	80	B	hey.	FA1
45	78	T	baby.	MA1

Instead what I'd like to see for a more general use case is one of the following two options:

Option 1: a row for each utterance...

speaker	onset	offset	xds	vcm	lex	mwu	transcription
MA1	0	20	A	NA	NA	NA	hello.
FA1	45	80	B	NA	NA	NA	hey.
MA1	45	75	T	NA	NA	NA	baby.
CHI	100	160	NA	C	W	M	dis [: what's this] Daddy?

(where there are NAs if there is no value of that type)

Option 2: a row for each annotation... (this is the actual ELAN mimic case)

speaker	tier	onset	offset	value
MA1	MA1	0	20	hello.
MA1	xds@MA1	0	20	A
FA1	FA1	45	80	hey.
FA1	xds@FA1	45	80	B
MA1	MA1	45	75	baby.
MA1	xds@MA1	45	75	T
CHI	CHI	100	160	dis [: what's this] Daddy?
CHI	vcm@CHI	100	160	C
CHI	lex@CHI	100	160	W
CHI	mwu@CHI	100	160	M

(in this case, value gets the empty string when there's nothing there)

fmetze commented 5 years ago

do you have an example eaf file with desired output?

On Apr 30, 2019, at 6:17 AM, Marisa Casillas notifications@github.com wrote:

In its current form, eaf2txt.py is specific to Okko's needs. It'd be useful to have a more general one that mimics the native ELAN tab-delimited output. In a nutshell, the current converter script gives the following columns for each non-CHI utterance:

onset offset xds transcription speaker 0 20 A hello. MA1 45 80 B hey. FA1 45 78 T baby. MA1 Instead what I'd like to see for a more general use case is one of the following two options:

Option 1: a row for each utterance...

speaker onset offset xds vcm lex mwu transcription MA1 0 20 A NA NA NA hello. FA1 45 80 B NA NA NA hey. MA1 45 75 T NA NA NA baby. CHI 100 160 NA C W M dis [: what's this] Daddy? (where there are NAs if there is no value of that type)

Option 2: a row for each annotation... (this is the actual ELAN mimic case)

speaker tier onset offset value MA1 MA1 0 20 hello. MA1 xds@MA1 0 20 A FA1 FA1 45 80 hey. FA1 xds@FA1 45 80 B MA1 MA1 45 75 baby. MA1 xds@MA1 45 75 T CHI CHI 100 160 dis [: what's this] Daddy? CHI vcm@CHI 100 160 C CHI lex@CHI 100 160 W CHI mwu@CHI 100 160 M (in this case, value gets the empty string when there's nothing there)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/DiViMe/issues/120, or mute the thread https://github.com/notifications/unsubscribe-auth/ABE4B4O3VCIB7F75Q3DQA23PTAMDZANCNFSM4HJKIY6A.

marisacasillas commented 5 years ago

Example .eaf (i.e., input file): https://www.dropbox.com/s/ioayaw8p1d35ea5/2337-0GS0.eaf?dl=0 Example .txt (i.e., output file): https://www.dropbox.com/s/jt0j14fwygej9kc/2337-0GS0.txt?dl=0

This is an example of Option 2, as described above, where each annotation (not each utterance!) gets a row in the output file. It is parallel to what you get if you manually export a tab-delimited text from ELAN and so should minimize confusion from people using both techniques to convert within a single lab.

fmetze commented 5 years ago

Pull the latest version and try, I added code that can produces output of the first kind. You can run it with:

python utils/eaf2txt.py -i 2337-0GS0.eaf -f marisa

Let me know what you want the name of the format switch to be, right now they are "-f marisa" and "-f okko" (the default) - probably there are better names.

The example that you gave me seems to be file format v3.0, which the pympi library cannot process. I manually edited 2337-0GS0.eaf to say "version 2.8", and it seems to work, but that is probably not really sufficient. We may need to do more testing.

fmetze commented 5 years ago

no more activity, closing for now

srvk / DiViMe

Request for general eaf2txt #120