Closed sckott closed 7 years ago
Interesting concept
I think the biggest issue I see with this concept is the <file>.R
- I think every dataset is unique in its own dirty way. We might be able to build some canonical scripts for public data sources but then at the point, we may as well save people hassle and make it into a dataset package, as the data manipulation wouldn't need to be important to most consumers of the data.
If you are going to have a metadata file you might as well use DESCRIPTION
and make them into real packages IMHO, then they can be easily installed with already available tools. They don't need to pass R CMD check or necessarily have function level documentation.
This idea brings to my mind a couple of existing services that could be used for inspiration or extended:
Morph.io, formerly Scraperwiki, is a directory of web scrapers written in various languages. It has facility to automate the running of scrapers to build public datasets. R is notably absent from supported languages! The maintainers told me they are very open to getting R support up, they need a hand though.
Kaggle datasets can now have associated user created R "kernels" - scripts that do munging and analysis. Kernels can be upvoted etc to aid searching and filtering. Example NBA player stats.
Both of these services organise scripts by data source which I think would be a good way to start with this idea.
If you are going to have a metadata file you might as well use DESCRIPTION
@jimhester I guess i wanted to keep it language agnostic (so someone could submit a python or julia script)
every dataset is unique in its own dirty way.
true, but it seems like there'd be common things one would want to do a given dataset
We might be able to build some canonical scripts for public data sources but then at the point, we may as well save people hassle and make it into a dataset package, as the data manipulation wouldn't need to be important to most consumers of the data.
@stephlocke yeah, but there's endless datasets, and I don't think we'll make a package for each one, def. for some though as is happening
@MilesMcBain
Morph.io, formerly Scraperwiki, is a directory of web scrapers written in various languages. It has facility to automate the running of scrapers to build public datasets. R is notably absent from supported languages! The maintainers told me they are very open to getting R support up, they need a hand though.
right, have heard of and used it a bit before. i'd lean a little toward an approach that isn't dependent on a company for long-term persistence though
Kaggle datasets can now have associated user created R "kernels" - scripts that do munging and analysis. Kernels can be upvoted etc to aid searching and filtering. Example NBA player stats.
cool, didn't know about that
Both of these services organise scripts by data source which I think would be a good way to start with this idea.
good idea to organize by data source
This is an idea been floating around for a while, but I'm hoping to get some feedback on it - either to kill it for good, or maybe they'll be some interest. Happy either way.
The idea: There's a lot of scripts out there (probably mostly not on the web, but on peoples computers) for downloading, cleaning, tidying, etc. datasets. A lot of scripts are probably duplicates more or less, doing the same thing over and over. Maybe we can curate re-usable scripts for cleaning datasets. Each script could be a separate github repo, and we could collect metadata about them Julia style with a metadata repo.
Each repo would need a set of files, e.g.,
README.md
- any info important to understanding the files<file>.R
- the code for cleaning<file>.json
- metadata<file>.csv
- a stub file for testing on CI to make sure script gives what's expected.travis.yml
- to run script on CI if possiblestarted doing stuff (just stubs): https://github.com/openscriptsorg/