lexibank / lsi

CLDF dataset derived from Grierson's "Linguistic Survey of India" from 1928
https://lsi.clld.org
Creative Commons Attribution 4.0 International
1 stars 0 forks source link

add languages.tsv #2

Closed LinguList closed 4 years ago

LinguList commented 4 years ago

use languages as in the source, and also provide a name that we can use in computational approaches (without unicode, etc., but still capital letters, no commas, no spaces), and also a glottolog code where available, and -- if available -- geolocations.

PhyloStar commented 4 years ago

Languages.txt

There are about 386 languoids and I could get Glottocodes for 340 languoids. Some of the languoids belong to Hmong-Mein family and I could not identify them. LSI lists these languages but does not propose any affiliation. I added comments about the glottocode choice too.

xrotwang commented 4 years ago

@PhyloStar let me know if you think some languages/dialects are missing in Glottolog, so that we can add these.

PhyloStar commented 4 years ago

rtc (ISO) name is Taungθa. It is not present in glottolog.

TINAULI, TINAULI_OF_SALT_RANGE, DHANNI are the Hindko dialects with hno (ISO code).

lmh (ISO code) is retired. Does it mean lamb1277 is still present?

Gyaomi language: https://en.wikipedia.org/wiki/Gyami The data is present in LSI vocabulary list.

xrotwang commented 4 years ago

Via @d97hah :

rtc (ISO) name is Taungθa. It is not present in glottolog.

[rtc] is one of those codes I am waiting to see a survey. Taungtha is indeed the name, but I doubt that Taungtha occurs in the LSI, could it be Taungthu (a Karen people = Karen-Pa'o blk)? Which vocabulary in LSI is tagged with [rtc]?

TINAULI, TINAULI_OF_SALT_RANGE, DHANNI are the Hindko dialects with hno (ISO code).

Yes, Tinauli is N Hindko indeed, I'll add a separate dialect glottocode for Tinauli for next PR. I am not sure what Tinauli of Salt Range refers to, which vocab is it? Dhanni IMHO is not N Hindko, but Western Panjabi or Pothohari depending on which vocab it is.

lmh (ISO code) is retired. Does it mean lamb1277 is still present?

No, use Yakkha

Gyaomi language: https://en.wikipedia.org/wiki/Gyami The data is present in LSI vocabulary list.

Use Chinese Mandarin -- these were soldiers stationed in the Qiang region.

LinguList commented 4 years ago

The following language names in etc/languages.tsv do not have a counterpart in the document:

Me-gyå
LušēI
Gaṛ’wālī
Šö/K’yang
Gōṇḍi
Banāp’arī
Baḍaga
Bag’ēlī
Western Pahāṛī, Jaunsārī
Wasĩ-veri/Veron
Chib’āli
Kiū̃ṭ’alī
B’ojpurī, Northern
Braj B’āk’ā
Bihārī, Mait’ilī
K’ārawā
Eastern Pahāṛī/K’as-kurā
Hindkī
Orāō̃
B’ojpurī, Southern
Chilāsi
Dak’inī
Ch’attīsgaṛ’ī
Kanaujī
Lōhōrōng
104.
Iškāšmī, Zēbakī
Pānk’ū
Wai-alā
Kōm
Paṅgwāḷī
Pašai, Western
Bašgali
Marāṭ’ī, Dēšī
Siripuriā
Banāp’arī 
Chaměaḷī
Chākmā
Paṣ̌tō, of Peshawar
Kurunꭓ/Orāō̃
Pǒgulī
Vernacular Hindōstānī
Kōlāmī
Lāṛī
Kaikāḍī
Ḍōgrī
Yüdɣa
Bag’āṭī
K’ōwār/Chitrālī
K’āndēšī
Šiɣnī
Mēwātī
Gujurī of Hazara
Mālvī
Kōhistānī, Gārwī
Kāṭ’iyāwāḍī
Lab’ānī of Berar
Anāl
D’annī
K’ār^awā
Nīmāḍī
Lahndā, of Shahpur
B’adrawāhī
Kōṅkaṇī
Hrāngk’ol
Bundēlī
Kōta
Banjōgi
of Ḍāh-Hanū
B’īlī
Chingpå/Kachin
Kāṅgrā
T’ādo
Ōrmuṛī
Šǒdōchī
Nīmāḍī 
Waꭓī
Central Pahāṛī, Kumaunī
Pašai, Eastern
K’ētrānī
Wazīrī
Pāḍarī
Hirōi-Lamgāng
Munjānī/Mungī
G’isāḍī
Kuḷuī
Oṛiyā
Pañjābī, written
Pahlavī
Maṇḍěāḷī
K’ārᵃwā
Lolo, ˨˦Ñ^i
Eastern Hindī, Awad’ī
Hallām
301.
Pañjābī, spoken
Šiṇā, Gilgitī
Ḍōḍā Sirājī
Chinbōn
of Drās
Lolo, ˨˦Ñi
Mūltānī
Maiyā̃
Mrū
Koḍagu
Cantonese
Malayāḷam
Balōchī, Makrānī
Yådwin
Kui/Kand’ī/Khond
Brāhūī
Nagpuriā
Mālvī 
Kachch’ī
Sarīkolī
Tināulī
Nāgpurī
Tuḷu
Punch’ī
Western Hindī, Hindōstānī
Pōwād’ī
Wasĩ-veru/Veron
Tōrwālī
Kašmīrī
Lolo, ˨˦Ñⁱ
Sind’ī, Vichōlī
Pōṭ’wārī
Sirmaurī
Gādī
Laši/Lechi
Rāmbanī
T’aḷī
Bāngarū
Kalāšā
Charōtarī
Chinbōk
Pūrūm
Lušēi
Magahī
Gujarātī, Standard
Kašṭawāṛī
Rājast’ānī, Mārwāṛi

I assume it is for some coding issues, I will attach the names I find in the digitization to compare. They should be identical, as otherwise, we cannot make the matching.

LinguList commented 4 years ago

languages.txt

LinguList commented 4 years ago

so the languages.txt is what the current code parses from the digitization, there are some matches, but not all.

LinguList commented 4 years ago

Update, by "slugging" the language names, we can match them, which is probably just what we need, so there are two cases left now:

104
301

@PhyloStar, can you look into the source of the digitization and see where these occur in a case where they can be mis-interpreted as a language? This points to errors in either the parsing algorithm or the digitization (similar to the concepts).

PhyloStar commented 4 years ago

There is a tab and space issue with two cases that I fixed and updated the raw data. I ran the lexibank_lsi.py code and it didn't complain.

PhyloStar commented 4 years ago

Languages.txt

Uploaded Languages.txt file here.

LinguList commented 4 years ago

If this is an uploaded version, you may just replace the file in etc/languages.tsv with this one.

PhyloStar commented 4 years ago

Done!