Closed LinguList closed 4 years ago
There are about 386 languoids and I could get Glottocodes for 340 languoids. Some of the languoids belong to Hmong-Mein family and I could not identify them. LSI lists these languages but does not propose any affiliation. I added comments about the glottocode choice too.
@PhyloStar let me know if you think some languages/dialects are missing in Glottolog, so that we can add these.
rtc (ISO) name is Taungθa. It is not present in glottolog.
TINAULI, TINAULI_OF_SALT_RANGE, DHANNI are the Hindko dialects with hno (ISO code).
lmh (ISO code) is retired. Does it mean lamb1277 is still present?
Gyaomi language: https://en.wikipedia.org/wiki/Gyami The data is present in LSI vocabulary list.
Via @d97hah :
rtc (ISO) name is Taungθa. It is not present in glottolog.
[rtc] is one of those codes I am waiting to see a survey. Taungtha is indeed the name, but I doubt that Taungtha occurs in the LSI, could it be Taungthu (a Karen people = Karen-Pa'o blk)? Which vocabulary in LSI is tagged with [rtc]?
TINAULI, TINAULI_OF_SALT_RANGE, DHANNI are the Hindko dialects with hno (ISO code).
Yes, Tinauli is N Hindko indeed, I'll add a separate dialect glottocode for Tinauli for next PR. I am not sure what Tinauli of Salt Range refers to, which vocab is it? Dhanni IMHO is not N Hindko, but Western Panjabi or Pothohari depending on which vocab it is.
lmh (ISO code) is retired. Does it mean lamb1277 is still present?
No, use Yakkha
Gyaomi language: https://en.wikipedia.org/wiki/Gyami The data is present in LSI vocabulary list.
Use Chinese Mandarin -- these were soldiers stationed in the Qiang region.
The following language names in etc/languages.tsv
do not have a counterpart in the document:
Me-gyå
LušēI
Gaṛ’wālī
Šö/K’yang
Gōṇḍi
Banāp’arī
Baḍaga
Bag’ēlī
Western Pahāṛī, Jaunsārī
Wasĩ-veri/Veron
Chib’āli
Kiū̃ṭ’alī
B’ojpurī, Northern
Braj B’āk’ā
Bihārī, Mait’ilī
K’ārawā
Eastern Pahāṛī/K’as-kurā
Hindkī
Orāō̃
B’ojpurī, Southern
Chilāsi
Dak’inī
Ch’attīsgaṛ’ī
Kanaujī
Lōhōrōng
104.
Iškāšmī, Zēbakī
Pānk’ū
Wai-alā
Kōm
Paṅgwāḷī
Pašai, Western
Bašgali
Marāṭ’ī, Dēšī
Siripuriā
Banāp’arī
Chaměaḷī
Chākmā
Paṣ̌tō, of Peshawar
Kurunꭓ/Orāō̃
Pǒgulī
Vernacular Hindōstānī
Kōlāmī
Lāṛī
Kaikāḍī
Ḍōgrī
Yüdɣa
Bag’āṭī
K’ōwār/Chitrālī
K’āndēšī
Šiɣnī
Mēwātī
Gujurī of Hazara
Mālvī
Kōhistānī, Gārwī
Kāṭ’iyāwāḍī
Lab’ānī of Berar
Anāl
D’annī
K’ār^awā
Nīmāḍī
Lahndā, of Shahpur
B’adrawāhī
Kōṅkaṇī
Hrāngk’ol
Bundēlī
Kōta
Banjōgi
of Ḍāh-Hanū
B’īlī
Chingpå/Kachin
Kāṅgrā
T’ādo
Ōrmuṛī
Šǒdōchī
Nīmāḍī
Waꭓī
Central Pahāṛī, Kumaunī
Pašai, Eastern
K’ētrānī
Wazīrī
Pāḍarī
Hirōi-Lamgāng
Munjānī/Mungī
G’isāḍī
Kuḷuī
Oṛiyā
Pañjābī, written
Pahlavī
Maṇḍěāḷī
K’ārᵃwā
Lolo, ˨˦Ñ^i
Eastern Hindī, Awad’ī
Hallām
301.
Pañjābī, spoken
Šiṇā, Gilgitī
Ḍōḍā Sirājī
Chinbōn
of Drās
Lolo, ˨˦Ñi
Mūltānī
Maiyā̃
Mrū
Koḍagu
Cantonese
Malayāḷam
Balōchī, Makrānī
Yådwin
Kui/Kand’ī/Khond
Brāhūī
Nagpuriā
Mālvī
Kachch’ī
Sarīkolī
Tināulī
Nāgpurī
Tuḷu
Punch’ī
Western Hindī, Hindōstānī
Pōwād’ī
Wasĩ-veru/Veron
Tōrwālī
Kašmīrī
Lolo, ˨˦Ñⁱ
Sind’ī, Vichōlī
Pōṭ’wārī
Sirmaurī
Gādī
Laši/Lechi
Rāmbanī
T’aḷī
Bāngarū
Kalāšā
Charōtarī
Chinbōk
Pūrūm
Lušēi
Magahī
Gujarātī, Standard
Kašṭawāṛī
Rājast’ānī, Mārwāṛi
I assume it is for some coding issues, I will attach the names I find in the digitization to compare. They should be identical, as otherwise, we cannot make the matching.
so the languages.txt
is what the current code parses from the digitization, there are some matches, but not all.
Update, by "slugging" the language names, we can match them, which is probably just what we need, so there are two cases left now:
104
301
@PhyloStar, can you look into the source of the digitization and see where these occur in a case where they can be mis-interpreted as a language? This points to errors in either the parsing algorithm or the digitization (similar to the concepts).
There is a tab and space issue with two cases that I fixed and updated the raw data. I ran the lexibank_lsi.py code and it didn't complain.
Uploaded Languages.txt file here.
If this is an uploaded version, you may just replace the file in etc/languages.tsv
with this one.
Done!
use languages as in the source, and also provide a name that we can use in computational approaches (without unicode, etc., but still capital letters, no commas, no spaces), and also a glottolog code where available, and -- if available -- geolocations.