phoible / dev

PHOIBLE data and development.
https://phoible.org/
GNU General Public License v3.0
121 stars 31 forks source link

No obvious way to get summary data (e.g. number of phonemes, number of vowels) per language #219

Closed SimonGreenhill closed 5 years ago

bambooforest commented 5 years ago
library(dplyr)

load("phoible.RData")

phoible %>% group_by(InventoryID, ISO6393) %>% summarize(phoneme.count=n())

phoible %>% group_by(InventoryID, ISO6393) %>% filter(SegmentClass=="vowel") %>% summarize(vowel.count=n())

etc. ?

SimonGreenhill commented 5 years ago

Sure, but it'd be nice to just be able to download a CSV with that in it :)

xrotwang commented 5 years ago

I tthink this kind of functionality should be built upon the CLDF version at cldf-datasets/phoible

Simon J Greenhill notifications@github.com schrieb am Mi., 1. Mai 2019, 13:00:

Sure, but it'd be nice to just be able to download a CSV with that in it :)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/phoible/dev/issues/219#issuecomment-488256960, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGUOKA56KCUHEMBJBEL7MDPTFZ5TANCNFSM4HJSMLPQ .

bambooforest commented 5 years ago

write.csv(df, file="dump.csv")

? ;)

i may have the dumps in Rmd somewhere in phoible-scripts.

drammock commented 5 years ago

If we're going to provide a CSV of summary statistics, I tend to agree with Robert that we should write a script that extracts what we want from the CLDF dataset, and publishes that downloadable summary on phoible.org somewhere. Adding such a thing to the dev repository, I worry it won't be very discoverable.

SimonGreenhill commented 5 years ago

Yeah, I clicked the download button here, which didn't include the inventories, so I then tried to download from the github repository. Makes sense to have it in the CLDF as that's the archivable unit

bambooforest commented 5 years ago

Ok, I guess I've been voluntold to do this. @SimonGreenhill - if you're in a hurry, here's a CSV with the counts dumped from dev:

https://github.com/bambooforest/phoible-scripts/tree/master/segment-counts

I will go ahead and generate the counts from CLDF and then @xrotwang we can add it to the website for download.

Note an important issue here is regarding TONE counts (@SimonGreenhill ). Some contributors (like UPSID) simply do not include tone in their analysis, so it's actually incorrect to assign the value as zero (which I've done for the time being above).

@drammock - I suppose I should identify which sources do not have tone and then assign them NA (even at the peril of people trying to do math on the CSV file) for those contributions that do not include tone. Or?

SimonGreenhill commented 5 years ago

Sorry to make more work for you Steve :/ I’m happy to work around it for now, so no rush on my part!

On 2/05/2019, at 10:47, Steven Moran notifications@github.com wrote:

Ok, I guess I've been voluntold to do this. @SimonGreenhill - if you're in a hurry, here's a CSV with the counts dumped from dev:

https://github.com/bambooforest/phoible-scripts/tree/master/segment-counts

I will go ahead and generate the counts from CLDF and then @xrotwang we can add it to the website for download.

Note an important issue here is regarding TONE counts (@SimonGreenhill ). Some contributors (like UPSID) simply do not include tone in their analysis, so it's actually incorrect to assign the value as zero (which I've done for the time being above).

@drammock - I suppose I should identify which sources do not have tone and then assign them NA (even at the peril of people trying to do math on the CSV file) for those contributions that do not include tone. Or?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

drammock commented 5 years ago

@bambooforest yes, I think it's best to preserve the NAs wherever possible. I suggest including a header comment in the CSV saying "NAs in the toneme column indicates a data source where tone was not fully described, doculect may or may not have tonemes"

xrotwang commented 5 years ago

@bambooforest what you put together looks like a good integrations test for the CLDF creation pipeline. I'll try to reproduce your numbers from the CLDF - which should be a good test case for the new CLDF -> SQLite functionality, I cobbled together last week.

bambooforest commented 5 years ago

@xrotwang i saw the issue on denormalizing CLDF tables -- is that in the cookbook somewhere?

xrotwang commented 5 years ago

@bambooforest the best description so far would be here https://github.com/clld/recipes/tree/master/Grambank#accessing-grambank-data-in-sqlite Not really complete, I'm afraid.

bambooforest commented 5 years ago

@xrotwang - so i suppose i just write my own R joining code for the phoible cldf dump and produce the numbers?

xrotwang commented 5 years ago

No, rather not. Ideally these numbers would fall out of SQL query run on PHOIBLE CLDF in SQLite. If you can wait for an hour, I'll whip up an example.

xrotwang commented 5 years ago

Ok, my SQLite conversion is still a bit buggy. But I'll need to fix this anyway :) SO thanks for another good test case!

xrotwang commented 5 years ago

So this is what computing one column of the summary looks like in SQL:

select
    c.cldf_id, l.cldf_iso639P3code, l.cldf_glottocode, count(v.cldf_id) as consonants 
from
    `contributions.csv` as c, 
    languagetable as l, 
    valuetable as v, 
    parameterTable as p 
where 
    v.cldf_languageReference = l.cldf_id 
    and v.cldf_parameterReference = p.cldf_id 
    and v.contribution_id = c.cldf_id 
    and p.segmentclass = 'consonant' 
group by c.cldf_id order by cast(c.cldf_id as int);
xrotwang commented 5 years ago

I'll add a section describing this to https://github.com/cldf-datasets/phoible/blob/master/README.md Note that this requires

bambooforest commented 5 years ago

@xrotwang -- ok so i don't need to do anything except update my dump of the dev CSV data with NAs for the sources that we know do not describe tone and to add a column that clarifies this. i'm also fixing and will push the updated table(s):

https://github.com/clld/phoible/issues/19

xrotwang commented 5 years ago

ok, yes, this would then become PHOIBLE 2.0.1 - or 2.1?

xrotwang commented 5 years ago

@SimonGreenhill @bambooforest https://github.com/cldf-datasets/phoible/commit/2a039648913a2aeaca49f60499b399bb5c0136e1

xrotwang commented 5 years ago

I also just learned using CASE inside sum for multiple conditioned counts in one query. Seems very useful.

bambooforest commented 5 years ago

@xrotwang regarding how to version -- i suppose this is 2.0.1 if we consider the cosmetic updates for symbols here (post 2.0 release):

https://github.com/phoible/dev/commit/444a46c9a94641d6c99f5c8bbe85b8ae1c6ce65f

but what should we do about tone on the website? Should I generate the CLDF dumps with NA (or NULL or None or whatever you prefer), so that in sources like UPSID it no longer displays "0" as the value for tone. This would actually be the fairest way of displaying the data, I think.

bambooforest commented 5 years ago

https://github.com/bambooforest/phoible-scripts/tree/master/segment-counts

contains NAs for tone counts for UPSID and SAPHON (the latter has a binary value for presence of tone in the original dataset, but it's not carried over to phoible -- @xrotwang @drammock i also added a note about this now in the contributors.csv, i.e. "Phonemes. Tone as a binary inventory level value (not incorporated into PHOIBLE).")

xrotwang commented 5 years ago

@SimonGreenhill so what would you use this data for? I'm trying to figure out whether that's an actual use case: Downloading the counts in a machine-readable format, but then doing something with it with no computing environment which would allow computing these numbers easily? For human inspection of the counts https://phoible.org/inventories is good enough, I'd say. Most other things mean you already have the tools to get the counts.

SimonGreenhill commented 5 years ago

Sure -- Olena (a new PhD student) wants to test the relationship between the number of phonemes (vowels, consonants) and the number of colexifications (i.e. if fewer phonemes in a language, the more 'collisions' you should have, right?).

At this stage she's just playing with data but I think there's definitely a need for the summary data to be available easily (you already calculate it for the webapp :) and it means I don't need to implement a quick and dirty and probably buggy solution!

xrotwang commented 5 years ago

Hm. But does this mean, you'd also need a summary table listing number of colexifications? I truly think, there's no way around doing some exploratory analysis to figure out whether some data fits your need, and figuring out how to compute summary stats may be just the right introduction to a dataset. E.g. for this use case, you may want to average the counts where there are more inventories for the same language - and you may need to figure out what to do with tones.

It's not that I want to hold the data hostage and force you to adopt formats or tools - but I really hope that CLDF and better tooling gets us away from the era of distributing custom CSV downloads.

xrotwang commented 5 years ago

That said - I'm looking into how to add that info to contributions.csv ...

xrotwang commented 5 years ago

@SimonGreenhill Ok, so here's the thing: I would find it somewhat acceptable to add these counts to contributions.csv - as a shortcut and maybe also a checksum. But to get what you want, you'd still have to join languages.csv - and do that via values.csv. So it's only a half-way solution. Adding a fully separate counts table OTOH feels wrong - and how much denormalization would you want there, i.e. where to stop?

xrotwang commented 5 years ago

@SimonGreenhill so here's the half-way solution: https://github.com/cldf-datasets/phoible/commit/c964fd5ef006d42a3158be0e73d90e64876d1ef0#diff-91c1a9334f853b04f5976e63ca6fc5a5 At least it's a bitof advertisement for the descriptive power of csvw.

xrotwang commented 5 years ago

It just occurred to me that maybe a CLDF metadata viewer might be useful - i.e. some human readable rendering of the metadata file, possibly with an ER diagram.

bambooforest commented 5 years ago

@xrotwang might be useful to get more adoption of CLDF if that's what you're looking for ;)

xrotwang commented 5 years ago

Nothing fancy yet, but a proof-of-concept: https://cldf.clld.org/mdviewer.html

bambooforest commented 5 years ago

Am I suppose to choose file (json)? I do so, but then nothing happens (Chrome, Safari on Mac OSX)

xrotwang commented 5 years ago

hm. will check with chrome.

xrotwang commented 5 years ago

hm. works with chrome for me, and yes, you'd have to choose the *-metadata.json file.

SimonGreenhill commented 5 years ago

Firefox here, looks good!

bambooforest commented 5 years ago
Screen Shot 2019-05-08 at 1 41 56 PM
xrotwang commented 5 years ago

@bambooforest you have tough luck with CLDF - ontology viewer doesn't work, now this. And I though I was going low-tech with only requiring underscore ...

bambooforest commented 5 years ago

I mean maybe I'm being a complete idiot here, but I'm using the phoible cldf version from Zenodo (cldf-datasets-phoible-350563f) and the file called (StructureDataset-metadata.json).

On the other hand, I suppose there's a future for me as a tester somewhere -- I am pretty good at breaking software.

Screen Shot 2019-05-08 at 1 55 46 PM
xrotwang commented 5 years ago

Could you try to do this with a javascript debugger attached?

xrotwang commented 5 years ago

@bambooforest Ah. I think I can reproduce the problem. This particular metadata file doesn't work indeed.

bambooforest commented 5 years ago

@xrotwang you mean the phoible json file isn't valid javascript? opps. :)

xrotwang commented 5 years ago

it is, but not for this particular viewer, yet.

xrotwang commented 5 years ago

@bambooforest ok, fixed now. thanx for breaking it :)

bambooforest commented 5 years ago

@xrotwang i suppose we can close this now? works now on mac chrome and safari (note the formatting is bit ugly):

Screen Shot 2019-05-09 at 9 41 59 AM
xrotwang commented 5 years ago

yes, feel free to close.

bambooforest commented 5 years ago

summary statistics via CLDF here:

https://github.com/cldf-datasets/phoible

and via phoible dev data here:

https://github.com/bambooforest/phoible-scripts/tree/master/segment-counts

SimonGreenhill commented 5 years ago

Awesome, thanks @bambooforest and @xrotwang