lexibank / lsi

CLDF dataset derived from Grierson's "Linguistic Survey of India" from 1928
https://lsi.clld.org
Creative Commons Attribution 4.0 International
1 stars 0 forks source link

Phoneme inventory list #14

Open PhyloStar opened 4 years ago

PhyloStar commented 4 years ago

@lingulist The correlation looks reasonable. I did a regression and plotted it in a Excel sheet. We can do a better plot for the paper. Here is the workflow.

PhyloStar commented 4 years ago

@xrotwang We are looking at technical validation aspect of the LSI dataset. One aspect is to correlate the phoneme inventory sizes against PHOIBLE inventory sizes. As mentioned in the above post, I intersected the PHOIBLE languages with LSI languages using glottocodes and obtained 117 languages. I did the intersection by extracting from values.csv. Is there a CLDF way of extracting the PHOIBLE inventory sizes? Then, the extraction procedure would be replicable right?

xrotwang commented 4 years ago

So the idea is to validate the segmentation in lsi by showing that it leads to reasonable phoneme inventories, right? I think I'd do this as follows:

  1. Getting inventories for lsi:

    • group cldf/forms.csv by Language_ID
    • aggregate the set of segments in each group
    • possibly categorize this set into vowels, consonants, etc. by looking up CLTS
    • lookup Glottocode by Language_ID in cldf/languages.csv
  2. Getting inventories for phoible:

    • use the CLDF data in https://doi.org/10.5281/zenodo.2677911
    • load the data into sqlite:
      cldf createdb  cldf/StructureDataset-metadata.json phoible.sqlite
    • get the inventories running a query like
      select
       l.cldf_id,
       l.cldf_glottocode,
       count(v.cldf_id) as all_segments,
       count(case when p.segmentclass = 'vowel' then v.cldf_id end) as vowels,
       count(case when p.segmentclass = 'consonant' then v.cldf_id end) as consonants,
       count(case when p.segmentclass = 'tone' then v.cldf_id end) as tones
      from
       languagetable as l, valuetable as v, parametertable as p
      where
       v.cldf_languagereference = l.cldf_id and
       p.cldf_id = v.cldf_parameterreference
      group by
       l.cldf_id, l.cldf_glottocode;
xrotwang commented 4 years ago

See also https://github.com/cldf-datasets/phoible/blob/master/README.md

xrotwang commented 4 years ago

As a lower boundary, you could also compare with the inventories derived from ASJP data. This data is now available in segmented form, see https://doi.org/10.5281/zenodo.3835822

PhyloStar commented 4 years ago

Thank you. The information of tone, vowel, and consonants is really cool to get from PHOIBLE. I will try to get the comparisons from here.