How to access (and modify) the innermost/first argument in an expression/pipeline?

orgadish commented 1 year ago

I have a couple scenarios where I want a function that takes in an R expression (pipeline), modifies the innermost (first) element and then runs the rest of the expression/pipeline. For example, I wrote a package cachedread which can take a read_csv(x) |> ... pipeline and automatically read into a cache file for subsequent reads.

The most straightforward way I've found to do this is to use functional programming and take the input and the rest of the function separately my_function <- function(.x, .f) .f(modify_input(.x)).

However, it would be nice if the function could be added to the end of an existing pipeline, i.e. take the whole expression and parse out the first element from the rest.

I've hacked together a solution (below) by looping/recursing down using the expression using rlang::call_fn and rlang::call_args, and then back up looping/recursing with rlang::call2, but it feels like there are probably already existing rlang (or other) functions that do this better?

separate_call <- function(call) {
  std_call <- rlang::call_standardise(call)

  .fn <- rlang::call_fn(std_call)
  .args <- rlang::call_args(std_call)
  .first_arg <- .args[[1]]
  .non_data_args <- .args[2:length(.args)]

  list(
    fn=.fn,
    first=.first_arg,
    args=.non_data_args
  )
}

separate_pipeline <- function(call) {
  stopifnot(rlang::is_call(call))

  # Placeholder for innermost input argument.
  input <- NULL

  prepend <- function(x, values) {
    append(x, values, after=0)
  }

  pipeline <- list()
  remaining_call <- call
  while(TRUE) {
    if(!rlang::is_call_simple(remaining_call)) {
      input <- remaining_call
      break
    }

    res <- separate_call(remaining_call)

    # Pass on the first argument
    remaining_call <- res$first

    # Remove the first argument from the list and prepend to the pipeline.
    res$first <- NULL
    pipeline <- prepend(pipeline, list(res))
  }
  list(input=input, pipeline=pipeline)
}

reunite_pipeline <- function(input, pipeline) {
  call <- input
  for(step in pipeline) {
    call <- rlang::call2(step$fn, call, !!!step$args)
  }

  return(call)
}

my_function <- function(expr) {
  call <- rlang::enquo(expr)
  out <- separate_pipeline(call)
  new_input <- modify_input(rlang::eval_tidy(out$input))
  reunite_pipeline(quote(new_input), out$pipeline)
}

lionel- commented 1 year ago

I think the sort of things you're trying to do is likely to be brittle and cause surprising behaviour in some cases (e.g. magrittr vs base pipe). I would not recommend metaprogramming here.

I don't really understand what you're trying to achieve. I'm imagining something like memoise::memoise(function(x) x |> foo()) but I'm not sure.

I'm closing this issue because it's meant for bugs and feature requests, but feel free to continue the discussion.

orgadish commented 1 year ago

I didn't know about memoise::memoise -- thanks! One of the main things I was trying to get with my cachedread utility is that it determines whether to re-read the input files based on whether they have been modified since they were cached. Right now I do this with cached_read(files, read_fn) where read_fn can be, for example, readr::read_csv, or an entire function pipeline \(.x) readr::read_csv(.x) |> janitor::clean_names(), etc. so that the output of the whole pipeline is cached. To use this functionality at the end of the pipeline (e.g. readr::read_csv(.x) |> janitor::clean_names() |> cachedread::use_caching()) with the checking of modified input files I have to be able to access that first element (files).

Another use case I'm working on is a function that allows conversion of a dplyr pipeline into dtplyr just by adding a line at the end (rather than having to add lazy_dt at the beginning and collect at the end). This is because I often find myself wanting to test whether the code I'm running would be quicker with dtplyr, so I want to test it quickly, or benchmark it. In these cases, I need access to the original df so I can replace it with lazy_dt(df) in the pipeline.

It would be great if there was an rlang::call_structure, for example, which broke down a function call into a series of calls and then allowed metaprogramming on that.

lionel- commented 1 year ago

Rerunning computations after a change in input resources seems like a good job for https://github.com/ropensci/targets

Unfortunately call_structure() feels out of scope for rlang, sorry.

r-lib / rlang

How to access (and modify) the innermost/first argument in an expression/pipeline? #1606