This repository holds the data that underlies the South American Phonological Inventory Database. It also contains Python routines for reading, checking, and writing the data files.
The langs/
directory contains a file for each language in the database. The files are YAML documents and are compatible with the version 1.2 JSON schema. The character encoding of all files is UTF-8.
Each file comprises one or more documents. There are two types of documents that may be present: 1) a synthesis
document that describes the phonological inventory of the language; 2) zero or more ref
documents that contain information gathered from each of the reference materials. There must be exactly one synthesis
document per yaml file and by convention should be the first document in the yaml file. Normally there should be at least one ref
document per language and one per bibliographic reference.
Each document is terminated by a line consisting only of ---
or end-of-file.
Data consists of scalars, sequences, and mappings in YAML parlance, which correspond to Python scalars, lists, and dicts.
synthesis
documentThe synthesis
document contains language metadata and describes the phonological inventory of the language as synthesized by the SAPhon project from the reference materials. There must be exactly one synthesis
document per language file.
The top-level scalar fields of the synthesis
document are described first:
doctype
: The document type. Must be synthesis
.
name
: The preferred citation form of the name of the language, in an orthographic form suited to academic publications. It may contain spaces, hyphens, diacritics, and non-Latin glyphs that would occur in the preferred orthographic representation, e.g. Arára do Mato Grosso, Aʔɨwa, Ashéninka (Apurucayali dialect).
short_name
: The language name abbreviated to around 12 characters or less, to be used in tables and plots where space is tight. Spaces, hyphens, diacritics, non-Latin glyphs are all permitted.
family
: This is the linguistic family of the language, or Isolate
for linguistic isolates.
Eight fields contain simple sequences (lists) of scalar values:
phonemes
: A list of the phonemes of the language, using symbols from the International Phonetic Alphabet. This list is synthesized from the entries listed in the ref
documents.
alternate_names
: A list of alternative or outdated names for the language.
iso_codes
: A sequence of ISO 639-3 codes for the language, or of our own devising when the ISO codes are inadequate. When we need to distinguish language varieties not distinguished by ISO 639-3, we add a three letter extension to the code with an underscore '_' separator. Ordinarily the sequence contains only one code, but more values occur when multiple ISO codes refer to the same language (e.g. [Huaylas-Conchucos Quechua]('langs/HuaylasCQ.txt' contains codes qxn
, qwh
).
countries
: A list of country names where the language is indigenous.
notes
: A list of notes relating to the language.
nasal_harmony
: Boolean indicating presence of nasal harmony (true) or not (false).
laryngeal_harmony
: Boolean indicating presence of laryngeal harmony (true) or not (false).
tone
: Boolean indicating presence of tone (true) or not (false).
Two fields contain sequences (lists) of mappings (dicts):
coordinates
: A list of the geographical coordinates for the language. Each entry in the list is a mapping of the fields:
latitude
: The coordinate latitude, given to 3 decimal places.longitude
: The coordinate longitude, given to 3 decimal places.elevation_meters
: The elevation in meters, rounded to the nearest integer meter. May be omitted if unknown.allophones
: A list of mappings of allophonic variants to phonemes in the language. Each entry in the list is a mapping of the fields:
allophone
: The allophonic variant, as written in IPA.phoneme
: The phoneme corresponding to the allophonic variant. The phoneme
must exactly match an entry in the phonemes
list.ref
documentsA ref
document contains information summarized from a bibliographic reference. There should be one ref
document for each reference.
The top-level scalar fields for ref
documents are:
doctype
: The document type. Must be ref
.
citation
: A bibliographic citation for the reference.
ref_notes
: A list of notes relating to the reference.
graphemes2phonemes
: A list of mappings of graphemes that appear in the reference document and the phoneme it corresponds with in the synthesis
phonemes list. Each entry in the list is a mapping of the fields:
grapheme
: The grapheme in the reference document.phoneme
: The phoneme corresponding to the grapheme, written in IPA. The phoneme
must exactly match an entry in the phonemes
list or be null.ref_allophones
: A list of mappings of allophonic variants to phonemes in the language, as described and written in the reference document. Each entry in the list is a mapping of the fields:
grapheme_allophone
: The allophonic variant.grapheme_phoneme
: The phoneme corresponding to the allophonic variant.YAML is a flexible format that allows for multiple styles of representing identical values. The preceding description of the data file format covers the semantics of the values, and this section describes in more detail the syntactic choices that should be followed when creating or editing language files. In most cases a different syntactic choice could have been made without altering the meaning of the data file, and the guidance in this section is to encourage consistency across language files.
Quoted and unquoted string values are allowed in YAML syntax. Most field values in SAPhon do not require quotes, and the general practice is to omit them where they are not necessary. The exception is the citation
field, which often requires surrounding quotes (because of embedded ':
For the ref
doctype
the fields should be listed in the order:
doctype
name
short_name
alternate_names
iso_codes
family
countries
coordinates
latitude
longitude
elevation_meters
phonemes
allophones
allophone
phoneme
nasal_harmony
tone
laryngeal_harmony
notes
synthesis
doctype
the fields should be listed in the order:doctype
citation
graphemes2phonemes
grapheme
phoneme
ref_allophones
grapheme_allophone
grapheme_phoneme
ref_notes
phonemes
list is created as a single line of comma-separated values enclosed by square brackets. This is the YAML 'flow sequence' style. For example:phonemes: [p, b, t, d, ɖ, tʃ, k, ɡ, ʔ, m, n, ɲ, s, ʐ, ʃ, w, j, ɽ, i, a, u, ɨ]
countries:
- Brazil
- Guyana
ref_allophonemes:
- grapheme_allophone: b
grapheme_phoneme: b
- grapheme_allophone: mb
grapheme_phoneme: b
- grapheme_allophone: m
grapheme_phoneme: m
elevation_meters
should have the value '.NAN' if no value is known.