lexibank / abvdoceanic

Creative Commons Attribution 4.0 International
5 stars 2 forks source link

Command to extract phoneme inventories for subgroup #25

Closed antipodite closed 2 years ago

antipodite commented 2 years ago

Added a quick script to pull out all languages within a subgroup and write their names, vowel and consonant inventories to a TSV sheet.

cldfbench abvdoceanic.inventories --family <glottocode>.

I considered connecting Phoible, but its coverage for Oceanic is even worse than I thought. Some time this week I will link it up to the Rongorongo data.

The other thing that would have been useful is adding a column with the filename of the orthography profile in the /etc folder, but the names of the profiles and the names I get from pyglottolog don't necessarily match up, so I will just get them out manually for Mary and Timo for now.

antipodite commented 2 years ago

hm you could actually just do this by filtering languages.csv... I guess this has more of a convenience factor tho

LinguList commented 2 years ago

Sorry, somehow I wasn't notified by github about this PR or I missed it over other PRs.

LinguList commented 2 years ago

But it is fine for me to merge this.

LinguList commented 2 years ago

Thanks, even if this can be done by filtering the data, it is still useful to have this script done.

antipodite commented 2 years ago

@LinguList cool, yes it's more convenient than filtering the data. I will add in the other stuff from the issues now. I also built some code that produces output like this (for the deep Austronesian dataset I'm working on with Simon):

╠══ Austronesian: 53949L, 40261C, 74.63% coverage ║ ╠══ Atayalic: 675L, 533C, 78.96% coverage ║ ║ ╠══ Atayal: 199L, 149C, 74.87% coverage ║ ║ ╚══ Seediq: 277L, 235C, 84.84% coverage ║ ╠══ Bunun: 245L, 208C, 84.9% coverage ║ ╠══ East Formosan: 1977L, 1593C, 80.58% coverage ║ ║ ╠══ Central East Formosan: 869L, 687C, 79.06% coverage ║ ║ ║ ╚══ Amis: 531L, 409C, 77.02% coverage ║ ║ ║ ╚══ Northern Amis: 193L, 131C, 67.88% coverage

This way you can compare coverage in different branches. I'm also going to add display by quartiles and coverage per cognate set in the same tree format. Is this something that would be useful to have elsewhere?

LinguList commented 2 years ago

Looks very nice. Is the coverage measured by Proto-Languages existing for a given node? Or how do I interpret the numbers? Depending on the input data, this is something one could include into the cldfbench/lexibank workflow.

antipodite commented 2 years ago

Here's an example of the full output including leaf nodes: the numbers at the protolanguage nodes are just the mean coverage i.e. cognates versus lexemes, over all the leaves in that subtree. I'm just adding in the other stuff now. Basically it just goes through the flat ABVD data and places it on the glottolog tree. Should be easy enough to get it to work on a cldf Wordlist object?

https://gist.github.com/antipodite/2525003d40848735540fd033bf5754bb#file-gistfile1-txt

depth is controllable of course

LinguList commented 2 years ago

Ah, L=Lexemes. Yes, it looks like this could be added as a pylexibank command, to be run on any dataset in CLDF then. Could you propose this as an issue on pylexibank, pointing to the code, so we can discuss it from there?