johnwdubois / rezonator

Rezonator: Dynamics of human engagement
34 stars 1 forks source link

Importing EXMARaLDA data #488

Open kayaulai opened 4 years ago

kayaulai commented 4 years ago

Is your feature request related to a problem? Please describe. EXMARaLDA is a data format that concentrates on the transcription and annotation of spoken data. It is currently handled by the EXMARaLDA software suite, which includes tools for transcription, corpus generation and searching. The INEL Selkup Corpus is based on EXMARaLDA.

Describe the solution you'd like The basic ideas behind the EXMARaLDA transcription and annotation format are the following:

The timeline is the main difference from ELAN. EXMARaLDA has a series of time points usually tied to a specific moment in the audio which can act as start/end points for events. Rezonator will have to identify one or more tiers that corresponds to the Unit level, one or more tiers that corresponds to the Token level, and speaker labels which are built into the EXMARaLDA tiers.

As EXMARaLDA is very high-level and its specific implementation will vary among corpora, it may be useful to focus on the approach by the INEL Selkup Corpus first. The INEL Selkup Corpus is divided into two main units: the sentence and the word. There are also morph-level group of words-level annoations. Thus we can only use the sentence as the item corresponding to the Unit in Rezonator. The word then maps onto the Token in Rezonator terms.

There are tiers for transcriptions and annotations of sentence-level material, as well as tiers for transcriptions and annotations of word-level material. The ref tier can be used as the tier to determine how to split the text into sentences, and the tx tier can be used to provide transcriptions of words. Otherwise, the contents of events in unit-level tiers should be stored as unit-level annotation, and the contents of events in word- and morph-level tiers should be stored as word-level annotations. If group-of-word-level annotations are present (they may not be: GRAID annotations are optional and may be word- or group of word-level, and code switching annotation, which is group-level, is optional), then they should be stored as chunk-level annotations.

Additional context As Exmaralda is similar to ELAN in its format, code may be reused from it.

Schmidt, Thomas & Kai Wörner. 2009. EXMARaLDA–Creating, analysing and sharing spoken language corpora for pragmatic research. Pragmatics. John Benjamins 19(4). 565–582.

Brykina, Maria; Orlova, Svetlana; Wagner-Nagy, Beáta. 2018. INEL Selkup Corpus. Version 0.1. Publication date 2018-12-31. Archived in Hamburger Zentrum für Sprachkorpora. http://hdl.handle.net/11022/0000-0007-CAE5-3. In: Wagner-Nagy, Beáta; Arkhipov, Alexandre; Ferger, Anne; Jettka, Daniel; Lehmberg, Timm (eds.). 2018. The INEL corpora of indigenous Northern Eurasian languages.

kayaulai commented 4 years ago

Some other good corpora in EXMARaLDA: