ropensci / rrrpkg

Use of an R package to facilitate reproducible research
255 stars 25 forks source link

Advice on easyMake package #12

Open gshotwell opened 8 years ago

gshotwell commented 8 years ago

Not sure if this is the right place for this. But I put together a very simple R package which detects dependencies between R files, and then generates a Makefile. I think it could be a useful piece in getting less command line-savvy R users to start using Makefiles, but it's still a ways away from that goal. If you have any thoughts or advice please let me know.

https://github.com/GShotwell/easyMake/tree/master

sckott commented 8 years ago

@jennybc @benmarwick thoughts? or our discussion forum https://discuss.ropensci.org/

benmarwick commented 8 years ago

Yes, this looks very interesting. I'm not much of a make/makefile user, myself. I've seen @cboettig making good use of them though, he might have a more informed opinion here. @GShotwell do you have any examples of this package used 'in the wild', in a research compendium, etc.?

sckott commented 8 years ago

we may want to bring in @richfitz given https://github.com/richfitz/remake

jennybc commented 8 years ago

When we get to automation in STAT 545 in a couple weeks time, I will invite them to try this out. I could also test it on some of the demo projects we show them, i.e. see how close it comes to the existing Makefiles.

gshotwell commented 8 years ago

Thanks @jennybc , that would be very helpful. I've tried the dependency detection on some of my own work, but since that's the work I had in mind when I wrote the package it's unsurprising that it does okay on those projects.

@benmarwick I don't have any examples of it being used in the wild (The package is only like 4 days old), but if anyone can recommend some good testing projects, I'd be grateful.

Why I posted here is that I'm trying to build the package around a model workflow for reproducible analysis, which I mostly cribbed from this repo. Right now easyMake assumes three big things about this workflow:

1) People will use explicit file names in their import and export statements. So they will write read.csv("data.csv") and not name <- "data.csv"; read.csv(name.

2) A given script will not have the same names for both its imports and exports. If a script loads "data.csv" and edits it, it should save it as "data2.csv"' not"data.csv"'. If you don't do this then you might end up with loops in an auto-created Makefile.

3) Scripts are pure in the sense that only communicate with the project through their imports and exports, so you don't run a script in order to store something in memory for a subsequent script to operate on. Basically you should be able to put rm(list = ls()) at the end of each script and not change your overall results.

All of the above means that running a script multiple times won't alter the results of the analysis. Do those seem like okay constraints?

jennybc commented 8 years ago

They sound reasonable to me.

Here are some small example pipelines. You could see if easyMake recreates these Makefiles:

https://github.com/STAT545-UBC/make-activity

https://github.com/STAT545-UBC/STAT545-UBC.github.io/tree/master/automation10_holding-area/02_automation-example_r-and-make

https://github.com/STAT545-UBC/STAT545-UBC.github.io/tree/master/automation10_holding-area/03_automation-example_render-without-rstudio