Open vincentarelbundock opened 10 years ago
Copied over from https://github.com/rOpenGov/psData/issues/8
You would have 2 files:
database_political_institutions.yaml (download url, bibtex cite, etc.) database_political_institutions.R (cleaning script with all transformations) And a standardized function:
get_data(): Parse .yaml file, download data if not already cached, and run R script. If flagged for caching, then copy yaml, raw data, processed data and R script to specified path.
If we get something like that, I would almost certainly contribute recipes.
Just had a pie in the sky thought: It would be interesting if we could create a really simple website where someone who hosted a data set could fill out a form with specific metadata and information on how to download the data set.
On submission of the web form the recipe would be generated and a pull request would be initiated.
This would make it really easy to contribute new recipes.
If someone has the chance to work on the implementation, we could probably arrange server space with rOpenGov.
Sounds good. This can be something to work on in #12.
In terms of repository structure, I think it would be beneficial to split each data source into separate files. The idea would be to create a standarized "recipe" format that would include all info about the dataset (e.g. where to download, bibtex cite, name of cleaning script, date updated), and then a cleaning script that does all the magic we need.
I use something like that locally, where I have a YAML file that specifies all the info and then an accompanying python script that I use for cleaning.
This makes user contributions very easy. They just cut and paste another "recipe" and include an R script that does the cleaning. The only thing psData has to do is provide a proper API to parse the recipe, download the data, and activate the cleaning script.
Think of something like the homebrew install for mac and its library of "formulas":
https://github.com/Homebrew/homebrew/tree/master/Library/Formula