levmichael / saphon

South American Phonological Inventory
4 stars 0 forks source link

SAPhon

This repository holds the data that underlies the South American Phonological Inventory Database. It also contains Python routines for reading, checking, and writing the data files.

Data file format

The langs/ directory contains a file for each language in the database. The files are YAML documents and are compatible with the version 1.2 JSON schema. The character encoding of all files is UTF-8.

Each file comprises one or more documents. There are two types of documents that may be present: 1) a synthesis document that describes the phonological inventory of the language; 2) zero or more ref documents that contain information gathered from each of the reference materials. There must be exactly one synthesis document per yaml file and by convention should be the first document in the yaml file. Normally there should be at least one ref document per language and one per bibliographic reference.

Each document is terminated by a line consisting only of --- or end-of-file.

Data consists of scalars, sequences, and mappings in YAML parlance, which correspond to Python scalars, lists, and dicts.

The synthesis document

The synthesis document contains language metadata and describes the phonological inventory of the language as synthesized by the SAPhon project from the reference materials. There must be exactly one synthesis document per language file.

The top-level scalar fields of the synthesis document are described first:

Eight fields contain simple sequences (lists) of scalar values:

Two fields contain sequences (lists) of mappings (dicts):

ref documents

A ref document contains information summarized from a bibliographic reference. There should be one ref document for each reference.

The top-level scalar fields for ref documents are:

Data entry conventions

YAML is a flexible format that allows for multiple styles of representing identical values. The preceding description of the data file format covers the semantics of the values, and this section describes in more detail the syntactic choices that should be followed when creating or editing language files. In most cases a different syntactic choice could have been made without altering the meaning of the data file, and the guidance in this section is to encourage consistency across language files.

  1. doctype
  2. name
  3. short_name
  4. alternate_names
  5. iso_codes
  6. family
  7. countries
  8. coordinates
    1. latitude
    2. longitude
    3. elevation_meters
  9. phonemes
  10. allophones
    1. allophone
    2. phoneme
  11. nasal_harmony
  12. tone
  13. laryngeal_harmony
  14. notes
  1. doctype
  2. citation
  3. graphemes2phonemes
    1. grapheme
    2. phoneme
  4. ref_allophones
    1. grapheme_allophone
    2. grapheme_phoneme
  5. ref_notes
phonemes: [p, b, t, d, ɖ, tʃ, k, ɡ, ʔ, m, n, ɲ, s, ʐ, ʃ, w, j, ɽ, i, a, u, ɨ]
countries:
- Brazil
- Guyana
ref_allophonemes:
- grapheme_allophone: b
  grapheme_phoneme: b
- grapheme_allophone: mb
  grapheme_phoneme: b
- grapheme_allophone: m
  grapheme_phoneme: m