Open christophergandrud opened 10 years ago
Here's a demo of an S4 class that could support the guidelines. It's built on top of another framework that allows to require some columns to exist and/or to conform to a condition (e.g. being a strictly positive integer). The resulting psData
object works like a data.frame
.
@briatte This looks great. I have to work on getting another paper out today, but we should start implementing it with two simple getter functions. The most simple ones in psData now are PolityGet
and DpiGet
.
I can spin these off into their own packages. Does this make sense as a way to proceed?
On a related side note: any interest in putting together something on this for the Open Knowledge Festival? It's down the street from my flat and looks interesting. The deadline is approaching soon.
Your most basic getters can be coded as calls to a single 'get' function: here's Polity IV and DPI, using a generic method. I've added one more function to have a ?get
help page where the parameters of the getters can be detailed.
I have also imported your country code routines (without looking at the actual code yet), and have added a little utility from ggplot2, try_require
, to ask the user to install any missing package.
Last, I'm having trouble with the S4 class here.
You might want to think about passing a function (e.g., read.csv
) rather than constructing a function using paste
(like paste0('read.', read)
), and then you could also use ...
to pass additional functions to the read function...this would make it more flexible in case there are issues with a particular file.
@christophergandrud The rOpenGov core team members have participated the OKFest in Helsinki 2012 and Geneve 2013. In both events we joined a larger open data hackathon to work on the rOpenGov topics (back then still focusing on Finland mostly). So Berlin 2014 would be a good place to meet up and continue this tradition with those who have the chance to join. I am thinking of coming but have to confirm.
The call for proposals is open. In principle we could try to organize rOpenGov hackathon but really attracting participants would be a lot of work, and I also believe that this project is best growing naturally by attraction rather than by advertising. But there may well be other, general hackathons where we could join after the events are announced. I would go for that if there will be a chance. I would be interested to presenting on rOpenGov, and advertising the work of our all package authors. But I doubt there will be a chance as presentations are discouraged in the instructions. If someone has other suggestions, just throw them in. If this seems to escalate we need to move discussion to another thread.
Anyone else planning to join OKFest in Berlin July 2014?
@antagomir The hackathon model would definitely be the way to go. There is a lot of good work that could be done expanding the range of getters, not to mention the other rOpenGov stuff that could be built.
I haven't been before, so others probably know what kind numbers of attendees we would need/etc. If maybe one or two other people are interested, let's move the discussion to another thread?
@leeper: I've adjusted the bottleneck to be soft and let unrecognized URL extensions or abbreviations go through intact. I'm keeping the bottleneck to parse particular formats and try to guess arguments (like separators) from URL extensions. This can be useful to add parsers for formats like JSON, SDMX, SQL or XML (e.g. have the get_data
function accept JSON files straight away by converting the object with rjson
or a related package).
@antagomir: I should be able to attend, and might get my research unit to pay for it if there is an obvious link between the session and quality of government :)
@christophergandrud: I have updated my draft of a psData
S4 class to the point where it should work with QOG, Powell and Thyne, Polity IV and DPI data. The draft passes CRAN checks.
Now we need some who knows S4 better than I do to check my specification and fix the little issues I'm having here and there!
I would like you guys to take a look at this (revised) proof of concept for a get_data()
function. The sheer simplicity of it, I think, is very appealing.
https://github.com/vincentarelbundock/dpi/blob/master/psData.R
get_data()
downloads the recipe and the download/cleaning script, uses eval
to load the download/cleaning script into memory, and processes the data.Clone the repo and cd
to it:
git clone https://github.com/vincentarelbundock/dpi
cd dpi
R
You just use it like this:
> source('psData.R')
> dat_dpi = get_data('database_political_institutions')
Download dataset info...
Download dataset cleaning script...
Download raw data...
Clean dataset...
> dat_dpi[1:5, 1:5]
countryname ifs year system yrsoffc
1 South Sudan 2011 0 1
2 South Sudan 2012 0 2
41 Afghanistan AFG 1975 0 2
42 Afghanistan AFG 1976 0 3
43 Afghanistan AFG 1977 0 4
In the middle of a meeting, but this looks great.
Thinking about it, though, it seems like executing downloaded code is a pretty big security risk. But yeah, the basic idea still works, I think.
The recipes could just just ship in CRAN packages and host the dev version on GitHub. This would basically be the same thing. I think it would get us the same result without the security/general the-internet-is-sometimes-wonky-and-less-stable-than-accessing-the-user's-file-system problems.
Agreed. Just need to find a clean way to redefine the download()
function based on user request.
@vincentarelbundock the recipe idea is a great one, and the analogy with homebrew
makes sense. The only downside is what you mention about security, because sandboxing just that is impossible.
The minimal option is to download the script(s) and to message the user about running it/them. Something funnier would be a subclass for raw data, and/or a clean_dataset
function :)
The spacetime R package http://www.jstatsoft.org/v51/i07/ has classes for spatiotemporal data. Not sure but this might already be sufficient for panel series data, or at least provide a starting point if we are ever going to expand the psData package class structures as discussed earlier.
Thanks @antagomir. I think though that that plan is being scrapped. There is a lot of similar functionality in other packages, that also differ by use. I'm not sure we add much by doing another here.
Yes you may be right.
Following the consensus in #5 a number of major changes will be made to psData for the version 1 release. These include:
get_data
a function for downloading panel data sets by calling getter and variable builder functions from associated packages.panel_set
a function for cleaning the downloaded data into a political science panel-series (psData
) object.panel_merge
a function for mergingpsdata
objects.