richfitz / remake

Make-like declarative workflows in R
Other
340 stars 32 forks source link

Tabular dependencies #56

Open jtbates opened 9 years ago

jtbates commented 9 years ago

I would like to define a map-like dependency so that the map function is only called on new input.

As an example, say we have code to get some tabular data:

load_table <- function(i) data.frame(idx=1:i, lower=letters[1:i])

and we have dependencies that perform a calculation on each row:

input: load_table(4)
output: df %>% transmute(upper=toupper(lower))

I'd like to be able to specify this dependency so that if the input changes

input: load_table(5)

our transformation is only performed on the new rows (only one call to toupper with "e" as input in this example). Is there a way to accomplish this with remake?

richfitz commented 9 years ago

Thanks for the suggestion. This is something I want too, especially as this will become really powerful when remake supports parallel execution (it's in the pipeline).

The basic idea I have to support this, which I had not remembered to push, is on the lists branch.

The current interface (which is up for grabs) looks like this.

The idea is that an object can be stored as a list by declaring a target like:

  mylist:
    command: make_list()
    list: true

The list: true flag changes how the object is stored internally in remakes database (see the storr help for what is going on here, but we store the object plus all its constituent parts in a reasonably efficient way).

Then to use the list, do this:

  mylist2:
    command: length(each(mylist))

(which will also save the output as a list object too). If mylist is updated then mylist2 only recomputes elements that have changed. Note that the each function is not a real R function. Instead it basically compiles to mylist2[i] <- lapply(mylist[i], length) where i is the set of changed elements. If the each function is not used then lists behave like normal objects:

  notlist:
    command: length(mylist)

This is pretty basic. It only works on lists and only returns lists. So no row-by-row data.frame processing, no returning simple vectors, no looping over matrices. No equivalent of mapply though that would be fairly easy to implement. No support for dealing with files (e.g. loop over files in a directory), but that's more related to #2. I think, but have not checked, that changing the length of the list might invalidate the cache and require rebuilding everything, but if that's important I could set it up to try and be more clever.

On the other hand, most of that could be implemented fairly easily (e.g., an eachrow function could signal row-by-row processing but we'd also need a flag to indicate if the final object should be a list or another data.frame). For now it'd be easy enough to add an intermediate step that does a df-to-list transformation including list: true and then proceed from there.

jtbates commented 9 years ago

I think, but have not checked, that changing the length of the list might invalidate the cache and require rebuilding everything, but if that's important I could set it up to try and be more clever.

When I tested it out, that is what I found - changing the length of the list does invalidate the cache. It would be super useful for me if changing the length of the list was supported. My use case is that I have new data coming in all the time and I'd like to avoid recomputing calculations on the old data as long as the function hasn't changed. Let me know if I can help!

richfitz commented 8 years ago

Update on a very old thread:

I gave up on the approach I started here as it was getting really complicated. Once we get the package tidied up, I'd like to give this another go, but it's a pretty major bit of surgery.

It's possible that something like whatever the solution to #70 might help