srophe / syriaca-data

Repository for Syriaca.org TEI data, used by srophe-eXist-app.
4 stars 16 forks source link

Implementing Glottolog URIs for language codes #949

Open wlpotter opened 2 years ago

wlpotter commented 2 years ago

The Glottolog project has more granular coverage of languages and fills several gaps left by the ISO codes. They have also built data links to other projects and sources, such as The Open Language Archives Community, Wikidata, and The Online Database of Interlinear Text. It might be worthwhile to consider implementing their URIs, in particular for the gaps left by ISO codes but perhaps even for all language codes.

As an example of a gap filled by Glottolog, take Christian Palestinian Aramaic. While both Jewish Palestinian Aramaic and Samaritan Aramaic have ISO 639-3 codes (jpa and sam, respectively), there is no code for CPA. Glottolog provides such a code: chri1239.

As an example of increased granularity, Glottolog provides a URI for Eastern Syriac distinguished from Western Syriac. Currently, we only have a way to distinguish these two at the level of scripts: syr-Syrn vs syr-Syrj.

This dialect granularity may not be useful or meaningful in every case, but the current, ISO-based system is constrained to the generic "syr" or "syrc" codes. (As a more meaningful example, Glottolog allows more precise designation of Boharic, Sahidic, etc. in place of ISO's generic "cop" for Coptic. This is a case we have run into in the Manuscript catalogue).

Glottolog's URIs also map to ISO codes, where available, so we would retain these links if using Glottocodes, e.g. https://glottolog.org/resource/languoid/id/clas1252, Classical Syriac, has a link to ISO code "syrc". For that matter it may be possible to traverse their LOD graph to find the next-broadest language code that has an ISO equivalent, so "East Syriac" would resolve up the tree to "Classical Syriac" which points to ISO "syrc".

We still need to determine exactly how to implement these in @xml:lang attributes. For now, we can use the ISO code when it is available and use an un-prefixed Glottocode, e.g. east2681, for other languages.

We should discuss further whether or not we should serialize everything to Glottolog for the sake of accuracy and precision.

wlpotter commented 2 years ago

@davidamichelson I've assigned this to you to edit the above comment as you see fit. We should discuss in an upcoming editors' meeting.