Closed SimonGreenhill closed 5 years ago
Sure, but it'd be nice to just be able to download a CSV with that in it :)
I tthink this kind of functionality should be built upon the CLDF version at cldf-datasets/phoible
Simon J Greenhill notifications@github.com schrieb am Mi., 1. Mai 2019, 13:00:
Sure, but it'd be nice to just be able to download a CSV with that in it :)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/phoible/dev/issues/219#issuecomment-488256960, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGUOKA56KCUHEMBJBEL7MDPTFZ5TANCNFSM4HJSMLPQ .
write.csv(df, file="dump.csv")
? ;)
i may have the dumps in Rmd somewhere in phoible-scripts.
If we're going to provide a CSV of summary statistics, I tend to agree with Robert that we should write a script that extracts what we want from the CLDF dataset, and publishes that downloadable summary on phoible.org somewhere. Adding such a thing to the dev repository, I worry it won't be very discoverable.
Yeah, I clicked the download button here, which didn't include the inventories, so I then tried to download from the github repository. Makes sense to have it in the CLDF as that's the archivable unit
Ok, I guess I've been voluntold to do this. @SimonGreenhill - if you're in a hurry, here's a CSV with the counts dumped from dev:
https://github.com/bambooforest/phoible-scripts/tree/master/segment-counts
I will go ahead and generate the counts from CLDF and then @xrotwang we can add it to the website for download.
Note an important issue here is regarding TONE counts (@SimonGreenhill ). Some contributors (like UPSID) simply do not include tone in their analysis, so it's actually incorrect to assign the value as zero (which I've done for the time being above).
@drammock - I suppose I should identify which sources do not have tone and then assign them NA (even at the peril of people trying to do math on the CSV file) for those contributions that do not include tone. Or?
Sorry to make more work for you Steve :/ I’m happy to work around it for now, so no rush on my part!
On 2/05/2019, at 10:47, Steven Moran notifications@github.com wrote:
Ok, I guess I've been voluntold to do this. @SimonGreenhill - if you're in a hurry, here's a CSV with the counts dumped from dev:
https://github.com/bambooforest/phoible-scripts/tree/master/segment-counts
I will go ahead and generate the counts from CLDF and then @xrotwang we can add it to the website for download.
Note an important issue here is regarding TONE counts (@SimonGreenhill ). Some contributors (like UPSID) simply do not include tone in their analysis, so it's actually incorrect to assign the value as zero (which I've done for the time being above).
@drammock - I suppose I should identify which sources do not have tone and then assign them NA (even at the peril of people trying to do math on the CSV file) for those contributions that do not include tone. Or?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@bambooforest yes, I think it's best to preserve the NAs wherever possible. I suggest including a header comment in the CSV saying "NAs in the toneme column indicates a data source where tone was not fully described, doculect may or may not have tonemes"
@bambooforest what you put together looks like a good integrations test for the CLDF creation pipeline. I'll try to reproduce your numbers from the CLDF - which should be a good test case for the new CLDF -> SQLite functionality, I cobbled together last week.
@xrotwang i saw the issue on denormalizing CLDF tables -- is that in the cookbook somewhere?
@bambooforest the best description so far would be here https://github.com/clld/recipes/tree/master/Grambank#accessing-grambank-data-in-sqlite Not really complete, I'm afraid.
@xrotwang - so i suppose i just write my own R joining code for the phoible cldf dump and produce the numbers?
No, rather not. Ideally these numbers would fall out of SQL query run on PHOIBLE CLDF in SQLite. If you can wait for an hour, I'll whip up an example.
Ok, my SQLite conversion is still a bit buggy. But I'll need to fix this anyway :) SO thanks for another good test case!
So this is what computing one column of the summary looks like in SQL:
select
c.cldf_id, l.cldf_iso639P3code, l.cldf_glottocode, count(v.cldf_id) as consonants
from
`contributions.csv` as c,
languagetable as l,
valuetable as v,
parameterTable as p
where
v.cldf_languageReference = l.cldf_id
and v.cldf_parameterReference = p.cldf_id
and v.contribution_id = c.cldf_id
and p.segmentclass = 'consonant'
group by c.cldf_id order by cast(c.cldf_id as int);
I'll add a section describing this to https://github.com/cldf-datasets/phoible/blob/master/README.md Note that this requires
pycldf
(which I'll release shortly)@xrotwang -- ok so i don't need to do anything except update my dump of the dev CSV data with NAs for the sources that we know do not describe tone and to add a column that clarifies this. i'm also fixing and will push the updated table(s):
ok, yes, this would then become PHOIBLE 2.0.1 - or 2.1?
@SimonGreenhill @bambooforest https://github.com/cldf-datasets/phoible/commit/2a039648913a2aeaca49f60499b399bb5c0136e1
I also just learned using CASE
inside sum
for multiple conditioned count
s in one query. Seems very useful.
@xrotwang regarding how to version -- i suppose this is 2.0.1 if we consider the cosmetic updates for symbols here (post 2.0 release):
https://github.com/phoible/dev/commit/444a46c9a94641d6c99f5c8bbe85b8ae1c6ce65f
but what should we do about tone on the website? Should I generate the CLDF dumps with NA (or NULL or None or whatever you prefer), so that in sources like UPSID it no longer displays "0" as the value for tone. This would actually be the fairest way of displaying the data, I think.
https://github.com/bambooforest/phoible-scripts/tree/master/segment-counts
contains NAs for tone counts for UPSID and SAPHON (the latter has a binary value for presence of tone in the original dataset, but it's not carried over to phoible -- @xrotwang @drammock i also added a note about this now in the contributors.csv, i.e. "Phonemes. Tone as a binary inventory level value (not incorporated into PHOIBLE).")
@SimonGreenhill so what would you use this data for? I'm trying to figure out whether that's an actual use case: Downloading the counts in a machine-readable format, but then doing something with it with no computing environment which would allow computing these numbers easily? For human inspection of the counts https://phoible.org/inventories is good enough, I'd say. Most other things mean you already have the tools to get the counts.
Sure -- Olena (a new PhD student) wants to test the relationship between the number of phonemes (vowels, consonants) and the number of colexifications (i.e. if fewer phonemes in a language, the more 'collisions' you should have, right?).
At this stage she's just playing with data but I think there's definitely a need for the summary data to be available easily (you already calculate it for the webapp :) and it means I don't need to implement a quick and dirty and probably buggy solution!
Hm. But does this mean, you'd also need a summary table listing number of colexifications? I truly think, there's no way around doing some exploratory analysis to figure out whether some data fits your need, and figuring out how to compute summary stats may be just the right introduction to a dataset. E.g. for this use case, you may want to average the counts where there are more inventories for the same language - and you may need to figure out what to do with tones.
It's not that I want to hold the data hostage and force you to adopt formats or tools - but I really hope that CLDF and better tooling gets us away from the era of distributing custom CSV downloads.
That said - I'm looking into how to add that info to contributions.csv
...
@SimonGreenhill Ok, so here's the thing: I would find it somewhat acceptable to add these counts to contributions.csv
- as a shortcut and maybe also a checksum. But to get what you want, you'd still have to join languages.csv
- and do that via values.csv
. So it's only a half-way solution. Adding a fully separate counts table OTOH feels wrong - and how much denormalization would you want there, i.e. where to stop?
@SimonGreenhill so here's the half-way solution: https://github.com/cldf-datasets/phoible/commit/c964fd5ef006d42a3158be0e73d90e64876d1ef0#diff-91c1a9334f853b04f5976e63ca6fc5a5 At least it's a bitof advertisement for the descriptive power of csvw.
It just occurred to me that maybe a CLDF metadata viewer might be useful - i.e. some human readable rendering of the metadata file, possibly with an ER diagram.
@xrotwang might be useful to get more adoption of CLDF if that's what you're looking for ;)
Nothing fancy yet, but a proof-of-concept: https://cldf.clld.org/mdviewer.html
Am I suppose to choose file (json)? I do so, but then nothing happens (Chrome, Safari on Mac OSX)
hm. will check with chrome.
hm. works with chrome for me, and yes, you'd have to choose the *-metadata.json
file.
Firefox here, looks good!
@bambooforest you have tough luck with CLDF - ontology viewer doesn't work, now this. And I though I was going low-tech with only requiring underscore ...
I mean maybe I'm being a complete idiot here, but I'm using the phoible cldf version from Zenodo (cldf-datasets-phoible-350563f) and the file called (StructureDataset-metadata.json).
On the other hand, I suppose there's a future for me as a tester somewhere -- I am pretty good at breaking software.
Could you try to do this with a javascript debugger attached?
@bambooforest Ah. I think I can reproduce the problem. This particular metadata file doesn't work indeed.
@xrotwang you mean the phoible json file isn't valid javascript? opps. :)
it is, but not for this particular viewer, yet.
@bambooforest ok, fixed now. thanx for breaking it :)
@xrotwang i suppose we can close this now? works now on mac chrome and safari (note the formatting is bit ugly):
yes, feel free to close.
summary statistics via CLDF here:
https://github.com/cldf-datasets/phoible
and via phoible dev data here:
https://github.com/bambooforest/phoible-scripts/tree/master/segment-counts
Awesome, thanks @bambooforest and @xrotwang
etc. ?