Phoneme inventory list - Githubissues

PhyloStar commented 4 years ago

@lingulist The correlation looks reasonable. I did a regression and plotted it in a Excel sheet. We can do a better plot for the paper. Here is the workflow.

Downloaded the cldf files for PHOIBLE and intersected the LSI languages with PHOIBLE using glottocode yielding 117 languages.
PHOIBLE inventories come from multiple sources. In such a case, I took the average of the inventory sizes across sources.
Phoneme inventory sizes for each glottocode is also printed with https://github.com/lexibank/lsi/blob/master/lsicommands/phonemes.py
Correlated the 117 languages' average phoneme inventory sizes against the phoneme inventory sizes from the word lists. The correlation is 0.46. Attaching the excel sheet. LSI_PHOIBLE_correlations.xlsx

PhyloStar commented 4 years ago

@xrotwang We are looking at technical validation aspect of the LSI dataset. One aspect is to correlate the phoneme inventory sizes against PHOIBLE inventory sizes. As mentioned in the above post, I intersected the PHOIBLE languages with LSI languages using glottocodes and obtained 117 languages. I did the intersection by extracting from values.csv. Is there a CLDF way of extracting the PHOIBLE inventory sizes? Then, the extraction procedure would be replicable right?

xrotwang commented 4 years ago

So the idea is to validate the segmentation in lsi by showing that it leads to reasonable phoneme inventories, right? I think I'd do this as follows:

Getting inventories for lsi:
- group cldf/forms.csv by Language_ID
- aggregate the set of segments in each group
- possibly categorize this set into vowels, consonants, etc. by looking up CLTS
- lookup Glottocode by Language_ID in cldf/languages.csv

Getting inventories for phoible:

use the CLDF data in https://doi.org/10.5281/zenodo.2677911

load the data into sqlite:

cldf createdb  cldf/StructureDataset-metadata.json phoible.sqlite

get the inventories running a query like

select
 l.cldf_id,
 l.cldf_glottocode,
 count(v.cldf_id) as all_segments,
 count(case when p.segmentclass = 'vowel' then v.cldf_id end) as vowels,
 count(case when p.segmentclass = 'consonant' then v.cldf_id end) as consonants,
 count(case when p.segmentclass = 'tone' then v.cldf_id end) as tones
from
 languagetable as l, valuetable as v, parametertable as p
where
 v.cldf_languagereference = l.cldf_id and
 p.cldf_id = v.cldf_parameterreference
group by
 l.cldf_id, l.cldf_glottocode;

xrotwang commented 4 years ago

As a lower boundary, you could also compare with the inventories derived from ASJP data. This data is now available in segmented form, see https://doi.org/10.5281/zenodo.3835822

PhyloStar commented 4 years ago

Thank you. The information of tone, vowel, and consonants is really cool to get from PHOIBLE. I will try to get the comparisons from here.

lexibank / lsi

Phoneme inventory list #14