johnwdubois / rezonator

Rezonator: Dynamics of human engagement
35 stars 2 forks source link

Corpus web pages for languages #599

Open johnwdubois opened 4 years ago

johnwdubois commented 4 years ago

Background Rezonator users can benefit from having a wide variety of samples of corpus data, for at least 2 reasons:

What to do Create web pages on Rezonator.com for corpus data from various languages, providing data suitable for use with Rezonator.

  1. On the main Rezonator.com pages, create top-level page called "Corpus".
  2. Use rezonator.com/corpus to host a separate page for each corpus.
  3. Organize the corpus pages according to the following hierarchy of categories, most inclusive first:
    • language (use standard ISO codes, same as for localization)
    • corpus name
    • data type
  4. For example, the corpus pages would include:
    • rezonator.com/corpus/en/santabarbaracorpus/transcript [original csv files]
    • rezonator.com/corpus/en/santabarbaracorpus/rez
    • rezonator.com/corpus/en/santabarbaracorpus/audio/wav
    • rezonator.com/corpus/en/santabarbaracorpus/audio/ogg
    • rezonator.com/corpus/en/santabarbaracorpus/metadata
    • rezonator.com/corpus/zh/spokentaiwanmandarin/transcript
    • rezonator.com/corpus/zh/spokentaiwanmandarin/rez
    • rezonator.com/corpus/zh/spokentaiwanmandarin/audio
    • rezonator.com/corpus/zh/pearfilm/transcript
    • rezonator.com/corpus/zh/pearfilm/rez
    • rezonator.com/corpus/zh/pearfilm/audio
    • rezonator.com/corpus/en/gum
    • rezonator.com/corpus/it/kiparla
    • rezonator.com/corpus/he/
    • rezonator.com/corpus/ru/
    • rezonator.com/corpus/kk/
    • rezonator.com/corpus/es/
    • etc.
  5. Use ISO 639-3 language codes when possible (see #639 ):
  6. The rezonator.com/corpus page will essentially be a table of contents, with links that take the user to a separate page for each specific language.
  7. Make sure licensing rights are handled accurately, legally, and ethically.

Future development

  1. Include both the original data file (to validate, and practice with, the import process), and the .rez file that results
  2. Include media (audio) as well as text files.
  3. For each language, try to include a wide variety of data types: songs, verse, interlinear glossed text, one word per line, CoNLL-U, etc.
  4. Include a link in the Rezonator software that takes the user to the main landing page (rezonator.com/corpus).
  5. Use analytics to keep track of how many people are accessing these pages and downloading corpus data from them.
  6. It is important to remove corpus data from both the Rezonator tool itself and from the Rezonator GitHub site, in order to:
    • slim down Rezonator
    • make sure that licensing issues for each corpus are addressed (see above)

Alternatives you have considered Perhaps use corpus.rezonator.com instead of rezonator.com/corpus? Probably not.

See also

979

kayaulai commented 3 years ago

For the languages of the corpora used, would glottocode be more appropriate? I think many linguists would be opposed to using the ISO code for the language they work with, especially if they don't like the ISO classification, whereas Glottolog is much more responsive to linguist feedback.