Closed marisacasillas closed 5 years ago
do you have an example eaf file with desired output?
On Apr 30, 2019, at 6:17 AM, Marisa Casillas notifications@github.com wrote:
In its current form, eaf2txt.py is specific to Okko's needs. It'd be useful to have a more general one that mimics the native ELAN tab-delimited output. In a nutshell, the current converter script gives the following columns for each non-CHI utterance:
onset offset xds transcription speaker 0 20 A hello. MA1 45 80 B hey. FA1 45 78 T baby. MA1 Instead what I'd like to see for a more general use case is one of the following two options:
Option 1: a row for each utterance...
speaker onset offset xds vcm lex mwu transcription MA1 0 20 A NA NA NA hello. FA1 45 80 B NA NA NA hey. MA1 45 75 T NA NA NA baby. CHI 100 160 NA C W M dis [: what's this] Daddy? (where there are NAs if there is no value of that type)
Option 2: a row for each annotation... (this is the actual ELAN mimic case)
speaker tier onset offset value MA1 MA1 0 20 hello. MA1 xds@MA1 0 20 A FA1 FA1 45 80 hey. FA1 xds@FA1 45 80 B MA1 MA1 45 75 baby. MA1 xds@MA1 45 75 T CHI CHI 100 160 dis [: what's this] Daddy? CHI vcm@CHI 100 160 C CHI lex@CHI 100 160 W CHI mwu@CHI 100 160 M (in this case, value gets the empty string when there's nothing there)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/DiViMe/issues/120, or mute the thread https://github.com/notifications/unsubscribe-auth/ABE4B4O3VCIB7F75Q3DQA23PTAMDZANCNFSM4HJKIY6A.
Example .eaf (i.e., input file): https://www.dropbox.com/s/ioayaw8p1d35ea5/2337-0GS0.eaf?dl=0 Example .txt (i.e., output file): https://www.dropbox.com/s/jt0j14fwygej9kc/2337-0GS0.txt?dl=0
This is an example of Option 2, as described above, where each annotation (not each utterance!) gets a row in the output file. It is parallel to what you get if you manually export a tab-delimited text from ELAN and so should minimize confusion from people using both techniques to convert within a single lab.
Pull the latest version and try, I added code that can produces output of the first kind. You can run it with:
python utils/eaf2txt.py -i 2337-0GS0.eaf -f marisa
Let me know what you want the name of the format switch to be, right now they are "-f marisa" and "-f okko" (the default) - probably there are better names.
The example that you gave me seems to be file format v3.0, which the pympi library cannot process. I manually edited 2337-0GS0.eaf to say "version 2.8", and it seems to work, but that is probably not really sufficient. We may need to do more testing.
no more activity, closing for now
In its current form, eaf2txt.py is specific to Okko's needs. It'd be useful to have a more general one that mimics the native ELAN tab-delimited output. In a nutshell, the current converter script gives the following columns for each non-CHI utterance:
Instead what I'd like to see for a more general use case is one of the following two options:
Option 1: a row for each utterance...
(where there are NAs if there is no value of that type)
Option 2: a row for each annotation... (this is the actual ELAN mimic case)
(in this case, value gets the empty string when there's nothing there)