Implement psData Overhall

christophergandrud commented 10 years ago

Following the consensus in #5 a number of major changes will be made to psData for the version 1 release. These include:

[ ] Separating the present getter and variable builder functions into their own packages.
[ ] Implement the framework laid out in the psData guidelines. Including:
[ ] get_data a function for downloading panel data sets by calling getter and variable builder functions from associated packages.
[ ] panel_set a function for cleaning the downloaded data into a political science panel-series (psData) object.
[ ] panel_merge a function for merging psdata objects.

briatte commented 10 years ago

Here's a demo of an S4 class that could support the guidelines. It's built on top of another framework that allows to require some columns to exist and/or to conform to a condition (e.g. being a strictly positive integer). The resulting psData object works like a data.frame.

christophergandrud commented 10 years ago

@briatte This looks great. I have to work on getting another paper out today, but we should start implementing it with two simple getter functions. The most simple ones in psData now are PolityGet and DpiGet.

I can spin these off into their own packages. Does this make sense as a way to proceed?

On a related side note: any interest in putting together something on this for the Open Knowledge Festival? It's down the street from my flat and looks interesting. The deadline is approaching soon.

briatte commented 10 years ago

Your most basic getters can be coded as calls to a single 'get' function: here's Polity IV and DPI, using a generic method. I've added one more function to have a ?get help page where the parameters of the getters can be detailed.

I have also imported your country code routines (without looking at the actual code yet), and have added a little utility from ggplot2, try_require, to ask the user to install any missing package.

Last, I'm having trouble with the S4 class here.

leeper commented 10 years ago

You might want to think about passing a function (e.g., read.csv) rather than constructing a function using paste (like paste0('read.', read)), and then you could also use ... to pass additional functions to the read function...this would make it more flexible in case there are issues with a particular file.

antagomir commented 10 years ago

@christophergandrud The rOpenGov core team members have participated the OKFest in Helsinki 2012 and Geneve 2013. In both events we joined a larger open data hackathon to work on the rOpenGov topics (back then still focusing on Finland mostly). So Berlin 2014 would be a good place to meet up and continue this tradition with those who have the chance to join. I am thinking of coming but have to confirm.

The call for proposals is open. In principle we could try to organize rOpenGov hackathon but really attracting participants would be a lot of work, and I also believe that this project is best growing naturally by attraction rather than by advertising. But there may well be other, general hackathons where we could join after the events are announced. I would go for that if there will be a chance. I would be interested to presenting on rOpenGov, and advertising the work of our all package authors. But I doubt there will be a chance as presentations are discouraged in the instructions. If someone has other suggestions, just throw them in. If this seems to escalate we need to move discussion to another thread.

Anyone else planning to join OKFest in Berlin July 2014?

christophergandrud commented 10 years ago

@antagomir The hackathon model would definitely be the way to go. There is a lot of good work that could be done expanding the range of getters, not to mention the other rOpenGov stuff that could be built.

I haven't been before, so others probably know what kind numbers of attendees we would need/etc. If maybe one or two other people are interested, let's move the discussion to another thread?

briatte commented 10 years ago

@leeper: I've adjusted the bottleneck to be soft and let unrecognized URL extensions or abbreviations go through intact. I'm keeping the bottleneck to parse particular formats and try to guess arguments (like separators) from URL extensions. This can be useful to add parsers for formats like JSON, SDMX, SQL or XML (e.g. have the get_data function accept JSON files straight away by converting the object with rjson or a related package).

@antagomir: I should be able to attend, and might get my research unit to pay for it if there is an obvious link between the session and quality of government :)

@christophergandrud: I have updated my draft of a psData S4 class to the point where it should work with QOG, Powell and Thyne, Polity IV and DPI data. The draft passes CRAN checks.

Now we need some who knows S4 better than I do to check my specification and fix the little issues I'm having here and there!

vincentarelbundock commented 10 years ago

I would like you guys to take a look at this (revised) proof of concept for a get_data() function. The sheer simplicity of it, I think, is very appealing.

https://github.com/vincentarelbundock/dpi/blob/master/psData.R

Recipes are held a Github repository (see the Julia language packaging system for an analogue)
Recipes hold all the information we need about a particular dataset. E.g. https://github.com/vincentarelbundock/dpi/blob/master/polity.yaml
get_data() downloads the recipe and the download/cleaning script, uses eval to load the download/cleaning script into memory, and processes the data.

Clone the repo and cd to it:

git clone https://github.com/vincentarelbundock/dpi
cd dpi
R

You just use it like this:

> source('psData.R')
> dat_dpi = get_data('database_political_institutions')
Download dataset info...
Download dataset cleaning script...
Download raw data...
Clean dataset...
> dat_dpi[1:5, 1:5]
   countryname ifs year system yrsoffc
1  South Sudan     2011      0       1   
2  South Sudan     2012      0       2   
41 Afghanistan AFG 1975      0       2   
42 Afghanistan AFG 1976      0       3   
43 Afghanistan AFG 1977      0       4

christophergandrud commented 10 years ago

In the middle of a meeting, but this looks great.

vincentarelbundock commented 10 years ago

Thinking about it, though, it seems like executing downloaded code is a pretty big security risk. But yeah, the basic idea still works, I think.

christophergandrud commented 10 years ago

The recipes could just just ship in CRAN packages and host the dev version on GitHub. This would basically be the same thing. I think it would get us the same result without the security/general the-internet-is-sometimes-wonky-and-less-stable-than-accessing-the-user's-file-system problems.

vincentarelbundock commented 10 years ago

Agreed. Just need to find a clean way to redefine the download() function based on user request.

briatte commented 10 years ago

@vincentarelbundock the recipe idea is a great one, and the analogy with homebrew makes sense. The only downside is what you mention about security, because sandboxing just that is impossible.

The minimal option is to download the script(s) and to message the user about running it/them. Something funnier would be a subclass for raw data, and/or a clean_dataset function :)

antagomir commented 9 years ago

The spacetime R package http://www.jstatsoft.org/v51/i07/ has classes for spatiotemporal data. Not sure but this might already be sufficient for panel series data, or at least provide a starting point if we are ever going to expand the psData package class structures as discussed earlier.

christophergandrud commented 9 years ago

Thanks @antagomir. I think though that that plan is being scrapped. There is a lot of similar functionality in other packages, that also differ by use. I'm not sure we add much by doing another here.

antagomir commented 9 years ago

Yes you may be right.

rOpenGov / psData

Implement psData Overhall #6