lexibank / lsi

CLDF dataset derived from Grierson's "Linguistic Survey of India" from 1928
https://lsi.clld.org
Creative Commons Attribution 4.0 International
1 stars 0 forks source link

Update required for Eastern Balochi and Eastern Bengali #28

Closed PhyloStar closed 3 years ago

PhyloStar commented 3 years ago

Eastern Balochi is here: https://github.com/lexibank/lsi/blob/master/etc/languages.tsv#L274

Eastern Bengali is here: https://github.com/lexibank/lsi/blob/master/etc/languages.tsv#L334

"Eastern" is the language name in LSI but we can identify that it is Balochi or Bengali since through the indentation in the LSI.

Another way to identify Balochi from Bengali is from the "numberInSource". Bengali has 546. and Balochi has 366.

We will need the following changes:

Update the line here: https://github.com/lexibank/lsi/blob/master/lexibank_lsi.py#L83

if number[:4] == "546.": language = "Bengali, Eastern"

Update the entry for Eastern Bengali in https://github.com/lexibank/lsi/blob/master/etc/languages.tsv to the following. The first column is updated to avoid conflict with Eastern of Balochi.

Bengali, Eastern EASTERN_BENGALI 333 Indo-European Family, Aryan sub-family Indo-Aryan 546. bgp east2744

The orthography file for EASTERN_BALOCHI is missing in etc/orthography directory. @lingulist: How do we go about this? We need to create a orthography file and update the one for EASTERN_BENGALI since the orthography profile for EASTERN_BENGALI is made from EASTERN_BALOCHI data as well.

I think this is the only remaining part. Once we get this done, I will rerun our ICSTLL analyses and add to the paper.

PhyloStar commented 3 years ago

@xrotwang Thank you. A question on orthography profile. What happens if a doculect does not have a orthography profile? EASTERNBALOCHI does not have one but still the segmentation and IPA transcription seems to work.

PhyloStar commented 3 years ago

Handling Eastern Bengali: When there are synonyms for a single meaning, then the if statement does not handle it. A proposal is to place the if statement before the if not language.strip(): block so that the language variable still gets EasternBengali. Otherwise, synonym for Sun is ignored.

xrotwang commented 3 years ago

Regarding orthography profiles: There's a default/fallback profile here https://github.com/lexibank/lsi/blob/master/etc/orthography.tsv - that may just be a remnant of a time with no language-specific profiles, so may not be intended.

PhyloStar commented 3 years ago

Thanks. I will copy the EASTERN BENGALI profile for EASTERNBALOCHI and edit it to get a new profile. Then, we can proceed with experiments.

Can you please look at the creation of reference trees? #18 Is it possible to extract a subtree from glottolog for a list of glottocodes using Pyglottolog? Otherwise, I will get a student to work on it.