openscripts? - Githubissues

sckott commented 7 years ago

This is an idea been floating around for a while, but I'm hoping to get some feedback on it - either to kill it for good, or maybe they'll be some interest. Happy either way.

The idea: There's a lot of scripts out there (probably mostly not on the web, but on peoples computers) for downloading, cleaning, tidying, etc. datasets. A lot of scripts are probably duplicates more or less, doing the same thing over and over. Maybe we can curate re-usable scripts for cleaning datasets. Each script could be a separate github repo, and we could collect metadata about them Julia style with a metadata repo.

Each repo would need a set of files, e.g.,

README.md - any info important to understanding the files
<file>.R - the code for cleaning
<file>.json - metadata
<file>.csv - a stub file for testing on CI to make sure script gives what's expected
.travis.yml - to run script on CI if possible

started doing stuff (just stubs): https://github.com/openscriptsorg/

stephlocke commented 7 years ago

Interesting concept

I think the biggest issue I see with this concept is the <file>.R - I think every dataset is unique in its own dirty way. We might be able to build some canonical scripts for public data sources but then at the point, we may as well save people hassle and make it into a dataset package, as the data manipulation wouldn't need to be important to most consumers of the data.

jimhester commented 7 years ago

If you are going to have a metadata file you might as well use DESCRIPTION and make them into real packages IMHO, then they can be easily installed with already available tools. They don't need to pass R CMD check or necessarily have function level documentation.

MilesMcBain commented 7 years ago

This idea brings to my mind a couple of existing services that could be used for inspiration or extended:

Morph.io, formerly Scraperwiki, is a directory of web scrapers written in various languages. It has facility to automate the running of scrapers to build public datasets. R is notably absent from supported languages! The maintainers told me they are very open to getting R support up, they need a hand though.

Kaggle datasets can now have associated user created R "kernels" - scripts that do munging and analysis. Kernels can be upvoted etc to aid searching and filtering. Example NBA player stats.

Both of these services organise scripts by data source which I think would be a good way to start with this idea.

sckott commented 7 years ago

If you are going to have a metadata file you might as well use DESCRIPTION

@jimhester I guess i wanted to keep it language agnostic (so someone could submit a python or julia script)

sckott commented 7 years ago

every dataset is unique in its own dirty way.

true, but it seems like there'd be common things one would want to do a given dataset

We might be able to build some canonical scripts for public data sources but then at the point, we may as well save people hassle and make it into a dataset package, as the data manipulation wouldn't need to be important to most consumers of the data.

@stephlocke yeah, but there's endless datasets, and I don't think we'll make a package for each one, def. for some though as is happening

sckott commented 7 years ago

@MilesMcBain

Morph.io, formerly Scraperwiki, is a directory of web scrapers written in various languages. It has facility to automate the running of scrapers to build public datasets. R is notably absent from supported languages! The maintainers told me they are very open to getting R support up, they need a hand though.

right, have heard of and used it a bit before. i'd lean a little toward an approach that isn't dependent on a company for long-term persistence though

Kaggle datasets can now have associated user created R "kernels" - scripts that do munging and analysis. Kernels can be upvoted etc to aid searching and filtering. Example NBA player stats.

cool, didn't know about that

Both of these services organise scripts by data source which I think would be a good way to start with this idea.

good idea to organize by data source

ropensci / unconf17

openscripts? #76