ropensci / unconf17

Website for 2017 rOpenSci Unconf
http://unconf17.ropensci.org
64 stars 12 forks source link

make using roxygen-like documentation for analysis directories #77

Open AliciaSchep opened 7 years ago

AliciaSchep commented 7 years ago

I've been thinking about whether it would be possible & useful to have roxygen-like tags for documenting input and outputs of analysis scripts that could be used for easily creating a makefile when needed. This idea is very related to first part of thread #5, particularly the second comment (from @njtierney) about the struggle to go from exploratory analysis to something reproducible and subsequent discussion of make, but as that thread has moved on a bit into testing/CI/pkg issues I figured I'd started a new thread.

The idea would be that in a given R script (or Rmarkdown) you might at some point read in inputs and at other points write outputs. You could tag inputs and outputs:

#' myfile.csv
#' A really cool data file!
#' @source coolwebsite.com
#' @input myinfile.csv
mytable <- read_csv("myfile.csv")

myoutput <- do_stuff(mytable)

#' myoutput.rds
#' My awesome calculated result
#' @output myoutput.rds
saveRDS(myoutput)

Then another script might have:

#' @input myoutput.rds
myinput <- readRDS("myoutput.rds")

Within the directory containing all these scripts, you could run a command that reads through all the scripts and their input and output files and creates a makefile. If there are any circular dependencies those would get flagged. The command would also create man pages for each input and output object, as well as an overall workflow documentation with a dependency graph linking to individual input/output documentation.

There already is an R package to automatically make makefiles from R scripts -- easyMake. It tries to automatically detect when a file reads in an input or exports a file. I think roxygen-like tags might be a bit more flexible and transparent, as you would be able to specify each input and output file without having to rely on all the input and output functions used being recognized. This roxygen-like system would also enable creation of a better documentation of the workflow and inputs/outputs than just the makefile or a dependency graph of filenames.

Perhaps rather than creating a new roxygen-like system, roxygen itself could also be adapted for this purpose?

stephlocke commented 7 years ago

I like the idea of enhanced metadata & documentation for my work

bzkrouse commented 7 years ago

Nice idea, I'm also interested in giving more attention to the struggle of organizing and keeping track of exploratory analysis. The concept of collecting metadata on analysis was also discussed in #23 - although also with emphasis on collecting information about results.

MilesMcBain commented 7 years ago

I only just noticed this issue in the midst of cleaning up mine. I think what you're describing here is a REALLY great idea. How about a name: makedown? 😉

hadley commented 7 years ago

I like this idea but I think generating a makefile will be error prone. Will be more robust (if more work) to manage the dependency graph in R itself.

AliciaSchep commented 7 years ago

Thanks @bzkrouse for this linking this to thread #23, I hadn't read through that one yet, and some of the goals are certainly shared, although I think this idea is more limited in scope. Compared to some of the fairly comprehensive systems discussed in that thread, the idea here is for something fairly minimal and very easy to incorporate into existing script-based anslyses

@MilesMcBain makedown sounds like a great name! Even if ultimately make itself isn't actually used...

As for using make versus managing things in R itself, I think the main benefit of using make is less work :grin: Although perhaps generating the makefile in a reliable way may prove harder than I am anticipating...

hadley commented 7 years ago

Generating the makefile will allow you get to a quick proof of concept up and running, and that's a great goal for the unconf. However, code generation in general is hard, and having the dependency graph in another environment means you can't do cool visualisations in R etc.

hadley commented 7 years ago

Another thing worth considering is if you could automatically detect inputs/outputs for many common situations - i.e. in your example above, you could parse the file and detect read.csv() and saveRDS() and automatically generate the input/output annotations. You'd still need manual annotations for non-standard functions, but you might be able give people a fairly comprehensive solution for free.

hadley commented 7 years ago

It would also be handy to be able have this work inline, although you'd need someway to represent that the output was new R objects:

if_needed(
  input = c(object("types"), "my_csv.csv"),
  output = c(object("df"), "my_plot.pdf"),
  {
    df <- read_csv("my_csv.csv", col_types = types)
    ggplot(df, aes(x, y)) + geom_point()
    ggsave("my_plot.pdf")
  }
)

And in that case you could determine the inputs and outputs from the code, so you could just write:

if_needed({
  df <- read_csv("my_csv.csv", col_types = types)
  ggplot(df, aes(x, y)) + geom_point()
  ggsave("my_plot.pdf")
})
hadley commented 7 years ago

I hope you don't mind but I've taken your basic idea and run with it: https://docs.google.com/document/d/1avYAqjTS7zSZn7JAAOZhFPkhkPvYwaPVrSpo31Cu0Yc/edit#. I'd love your thoughts!

AliciaSchep commented 7 years ago

Definitely don't mind, looks great! In terms of my original idea, there were two related goals, one of which was linking dependencies across R files (without having to create your own make file), and the other was to enable documentation of inputs and outputs so as to be able to create documented dependency graph. Proposal for lazyr seems like great solution for first goal, but doesn't necessarily help for second, although perhaps those goals shouldn't have been linked anyways.

coatless commented 7 years ago

This feels like the merging of CodeDepends and YesWorkflow / Live Demo, which would be very useful.