johnwdubois / rezonator

Rezonator: Dynamics of human engagement
35 stars 2 forks source link

Elan tab-delimited import #375

Open johnwdubois opened 5 years ago

johnwdubois commented 5 years ago

Background Rezonator users may want to import data produced in popular software such as Elan. Elan is widely used by linguists, anthropologists, and others, especially for transcribing audio and video recordings of conversation. A useful workflow is to:

The solution you'd like To import data from Elan, do the data exchange in two steps:

  1. First, use Elan to export the file , using a commonly used file format, such as a tab-delimited file.
    • open the Elan transcription file (.eaf file)
    • From the menu, select "File/Export as/Tab-delimited text".
    • Select the appropriate export options, checking the boxes as shown in the screenshot below.
  2. Second, use Rezonator to import the tab-delimited text. (See next section for details.)

Screenshot The following shows how to select options for the "Export as tab-delimited text" option in Elan:

Export from Elan

Import into Rezonator

  1. Now use Rezonator to import the tab-delimited file.
  2. Rezonator then creates its own internal version of the file, which more or less clones (closely mimics) the file structure of the tab-delimited file (inheriting most aspects of its data structure from the original Elan file).
  3. Within Rezonator, it is important to correctly handle the fields of data commonly encoded in Elan. Each field should be assigned to the correct field in the Rezonator node map. This will require mapping into Rezonator fields from the fields labeled by the Elan conventions, such as:
    • timestamps (e.g. begin time, end time, total length)
    • participants (speaker labels)
  4. The imported data needs to be tokenized in the usual way.
  5. The text field should contain all the tokens (e.g. morphemes, words, pauses, vocalisms, etc.).
  6. By default, Elan groups utterances by the participant who produced it, not by the conversational sequence. For a good result, Rezonator should use the time-stamp information to sort the utterances into the original conversational sequence. (Sort by unitStartTime, then by unitEndTime.)
  7. For more complex Elan transcriptions, this may involve fields such as text, transcription, gloss, translation, etc. The Rezonator import screen should allow users to specify the mapping between Elan field names and the corresponding Rezonator field names.

Documenting the Elan export

  1. One goal is to simply document the process of exporting from Elan. Even if the Elan documentation already describes how to export a tab-delimited file, Rezonator users will benefit from us documenting the simplest way possible to export from Elan, and import into Rezonator.
  2. For details on how to export a tab-delimited file (to be used as our data exchange format), see the Elan documentation on:
    Exporting a document as a tab-delimited text file.
  3. For general information on Elan and the .eaf format, see the documentation on Elan

Alternatives you've considered It may be possible for Rezonator to import an Elan file (.eaf) directly. This would require a schema to interpret and process the .eaf format files used by Elan. The question is whether this would be cost-effective.

  1. Evaluate whether it makes sense to:
    • use the existing Elan export functions to create an exchange file format (as described above), or
    • import an Elan file (.eaf) directly, using a schema to interpret and process the .eaf format files used by Elan
johnwdubois commented 4 years ago

Here's a link to the file to use for testing (based on SBC001, as exported from Elan):
https://ucsb.box.com/s/16c7w3xghpa01dpbi2mj6rsxldbe3ndc

gtroiani commented 4 years ago

Rezonator