Open PhyloStar opened 4 years ago
@xrotwang We are looking at technical validation aspect of the LSI dataset. One aspect is to correlate the phoneme inventory sizes against PHOIBLE inventory sizes. As mentioned in the above post, I intersected the PHOIBLE languages with LSI languages using glottocodes and obtained 117 languages. I did the intersection by extracting from values.csv. Is there a CLDF way of extracting the PHOIBLE inventory sizes? Then, the extraction procedure would be replicable right?
So the idea is to validate the segmentation in lsi by showing that it leads to reasonable phoneme inventories, right? I think I'd do this as follows:
Getting inventories for lsi:
cldf/forms.csv
by Language_ID
Language_ID
in cldf/languages.csv
Getting inventories for phoible:
cldf createdb cldf/StructureDataset-metadata.json phoible.sqlite
select
l.cldf_id,
l.cldf_glottocode,
count(v.cldf_id) as all_segments,
count(case when p.segmentclass = 'vowel' then v.cldf_id end) as vowels,
count(case when p.segmentclass = 'consonant' then v.cldf_id end) as consonants,
count(case when p.segmentclass = 'tone' then v.cldf_id end) as tones
from
languagetable as l, valuetable as v, parametertable as p
where
v.cldf_languagereference = l.cldf_id and
p.cldf_id = v.cldf_parameterreference
group by
l.cldf_id, l.cldf_glottocode;
As a lower boundary, you could also compare with the inventories derived from ASJP data. This data is now available in segmented form, see https://doi.org/10.5281/zenodo.3835822
Thank you. The information of tone, vowel, and consonants is really cool to get from PHOIBLE. I will try to get the comparisons from here.
@lingulist The correlation looks reasonable. I did a regression and plotted it in a Excel sheet. We can do a better plot for the paper. Here is the workflow.
Downloaded the cldf files for PHOIBLE and intersected the LSI languages with PHOIBLE using glottocode yielding 117 languages.
PHOIBLE inventories come from multiple sources. In such a case, I took the average of the inventory sizes across sources.
Phoneme inventory sizes for each glottocode is also printed with https://github.com/lexibank/lsi/blob/master/lsicommands/phonemes.py
Correlated the 117 languages' average phoneme inventory sizes against the phoneme inventory sizes from the word lists. The correlation is 0.46. Attaching the excel sheet. LSI_PHOIBLE_correlations.xlsx