tidyverse / reprex

Render bits of R code for sharing, e.g., on GitHub or StackOverflow.
https://reprex.tidyverse.org
Other
740 stars 80 forks source link

Allow user to specify variables to be included as dput #7

Closed dgrtwo closed 7 years ago

dgrtwo commented 9 years ago

One common thing users forget to do in a reproducible example is to share the variables they've already defined (for example, datasets they're working with). You can turn a variable into a reproducible chunk of code with dput.

If we already have a few variables defined, it would be good to be able to add them to the start of the reprex code automatically. Here's a hello-world version using lazy_eval:

define_vars <- function(...) {
  vs <- lazyeval::lazy_dots(...)

  varnames <- unname(sapply(vs, function(v) as.character(v$expr)))

  dputs <- sapply(vs, function(v) {
    v$expr <- substitute(capture.output(dput(x)), list(x = v$expr))
    lazyeval::lazy_eval(v)
  })

  paste(varnames, dputs, sep = " <- ")
}

This takes one or more variables as its arguments and turns them into code that defines those variables. For example:

a <- 1:5
d <- data.frame(a = 1:5, b = 2)

varcode <- define_vars(a, d)

# turns it into code
cat(varcode, sep = "\n")
#> a <- 1:5
#> d <- c("structure(list(a = 1:5, b = c(2, 2, 2, 2, 2)), .Names = c(\"a\", ", "\"b\"), row.names = c(NA, -5L), class = \"data.frame\")")

In the end version of reprex, I imagine having a vars function, which creates an argument suitable for passing in:

y <- 1:100
reprex({y + 2}, vars = vars(y))
# even though it wasn't included in the reprex, resulting code would start with
# y <- 1:100

Or just wrap with a list (though that takes a little expression-parsing internally to get the names out):

reprex({y + 2}, vars = list(y))

If you like this feature idea I'll add it to the next PR.

(Fun fact; after I finished this example I used reprex to share it. Works like a charm!)

romainfrancois commented 9 years ago

Can't it somehow automatically deduce that y is a variable whose definition has to be included in the reprex ?

dgrtwo commented 9 years ago

@romainfrancois That's a terrific idea. Unfortunately I can't think of an approach that would do that, at least for longer and more detailed reproducible examples. (Outside of something silly like "run the reprex again and again, and each time it gets an exception for a variable not found, add that variable).

Do you have any ideas? Maybe something like creating a lazy version of the user's environment, evaluating the expression, and seeing which variables were retrieved?

romainfrancois commented 9 years ago

In a way, it shares some properties with what we do in dplyr's hybrid evaluation.

It's a lot of massaging the expression internally, but it's not the most bulletproof part of it, and indeed it gets harder with examples beyond hello world.

dgrtwo commented 9 years ago

Especially consider reprexes that contain dplyr or ggplot2 code with NSE. Variable definitions may not be immediately apparent but they don't need to be defined.

I think a robust solution would have to involve running the code and not just inspecting it

On Aug 26, 2015, at 5:54 PM, Romain François notifications@github.com wrote:

In a way, it shares some properties with what we do in dplyr's hybrid evaluation.

It's a lot of massaging the expression internally, but it's not the most bulletproof part of it, and indeed it gets harder with examples beyond hello world.

— Reply to this email directly or view it on GitHub.

jennybc commented 9 years ago

(Fun fact; after I finished this example I used reprex to share it. Works like a charm!)

Yay this!

The discovery and reverse engineering of required objects seems really cool ... and hard. I think the main function should remain fairly dumb on this point, to keep things simple. At the risk of being paternalistic, I also think it's healthy to make the user keep editing the code until they've created a proper reprex.

I'd say any efforts to save user from him/herself should go in a distinct function.

jennybc commented 9 years ago

@mrdwab, from the overflow package reached out in #9. This thread and the sodput() function got me thinking. I could imagine having a mechanism for the local code to include something like reprex_dput(<R code to create an object, not to be shared with the world>) and only the dput() command would appear in the result. Does that have value?

dgrtwo commented 9 years ago

Hmm, I think writing the code to create an object is the actual pain point. The problem is not so much that you don't want to share that code as that you don't want to write it. Many users already have a variable x or y or whatever, have run into a problem, and want to share those variables along with it.

I do think someone could include in the reprex code

reprex_vars(x, y, z)

And then it would do something similar to my define_vars function above. (However, you'd have to be smart about allowing reprex_vars to work with the reprexer's environment, but for everything else to work in a clean environment. See #11 )

klmr commented 9 years ago

@dgrtwo Actually finding the dependencies automatically isn’t that hard — at least for objects it’s a simple matter of iterating through the parse tree of the expression passed to reprex and collect all variables. And R already has a function to do this — all.vars (it’s a tad more complicated than that, of course).

This doesn’t work for all objects though because not all can be correctly reproduced via dput. In addition, this would require collecting used, attached packages; used R options; and hidden global variables (such as the random number seed). In fact, I think the cache option in knitr has to do something similar and the respective code could probably be stolen.

However, I understand @jennybc’s concerns. Still, this would be interesting as a separate function/package.

dgrtwo commented 9 years ago

@klmr But that approach wouldn't work for non-standard evaluation, as in ggplot2's aes or dplyr (or data.table, etc). We could fix some of it by saying "use only vars that are present in the caller's environment," though that could still lead to false positives if a column name matches an environment variable.

(User-defined functions would also be missed, though that's not a big deal)

klmr commented 9 years ago

I hadn’t considered NSE, true. Functions aren’t a problem: all.vars can handle them.

dgrtwo commented 9 years ago

Ah, I had been thinking about how if we included functions it would also try to include functions from packages. But now that I think of it, for each function name in the caller's environment, we could get it and see whether it's attached from a package.

I'm coming around to the idea of auto-finding the variables, though I think we'd need to

  1. Show a message about variables that are used but aren't found in the environment, though default to assuming they're NSE
  2. Build an explicit exception for variables within some functions, especially ~ and aes, that commonly contain NSE. dplyr is harder (we could include a list of dplyr's NSE functions, but it's a bit odd to be so specific), and data.table would be a real drag.
  3. Give the option of turning it off, in case people run into situations we haven't considered, at which point people could specify what variables to define explicitly (something like reprex({y + 2}, vars = list(y)))
dgrtwo commented 9 years ago

Made some progress on parsing code, and figuring out from it what variables need to be dputted and what packages need to be loaded. I think we can cover about 95% of reproducible examples this way.

Posting it here for now, more work later.

dgrtwo commented 9 years ago

(Also set up a start of a package here) to handle the logic of parsing expressions into a data frame)

jimhester commented 9 years ago

For detecting global variables in an expression you may want to look at codetools::findGlobals() and the pryr::f implementation, see https://github.com/hadley/pryr/blob/87bec0fede7da6f42a0c65b424f3ba33743fb8fd/R/f.r#L21-L23.

An alternative idea to detecting what variables are missing is to use an .() annotation like I did in my lambda package, (https://github.com/jimhester/lambda/blob/e493d10459eb356fabe66bb4bc935854cc3f173b/R/package.R)

This would let you write something like

x <- data.frame(num = 1:10, alpha = letters[1:10])

reprex({ subset(.(x), num > 5) })

Although that is mostly a syntactic convenience, I think it would actually be best to just list them explicitly with vals = as suggested.

reprex({ subset(x, num > 5) }, vals = x)

Simple implementation is

reprex2 <- function(x, vals, ...) {
  x <- deparse(substitute(x))

  if (!missing(vals)) {
    nms <- all.vars(substitute(vals))
    exprs <- lapply(nms,  function(x_)
        paste(x_, "<-", paste(collapse = "\n",
            capture.output(dput(get(x_, parent.frame()))))))
    x <- c(unlist(exprs), x)
  }
  con <- textConnection(x)
  on.exit(close(con))

  reprex(infile = con, ...)
}

And example usage

x <- 1:100
y <- data.frame(num = 1:10, alpha = letters[1:10])
z <- mtcars
reprex2({a <- 1}, vals = c(x, y, z))
x <- 1:100
y <- structure(list(num = 1:10, alpha = structure(1:10, .Label = c("a", 
"b", "c", "d", "e", "f", "g", "h", "i", "j"), class = "factor")), .Names = c("num", 
"alpha"), row.names = c(NA, -10L), class = "data.frame")
z <- structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 
24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 
30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8, 
19.7, 15, 21.4), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 
8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8, 4), 
    disp = c(160, 160, 108, 258, 360, 225, 360, 146.7, 140.8, 
    167.6, 167.6, 275.8, 275.8, 275.8, 472, 460, 440, 78.7, 75.7, 
    71.1, 120.1, 318, 304, 350, 400, 79, 120.3, 95.1, 351, 145, 
    301, 121), hp = c(110, 110, 93, 110, 175, 105, 245, 62, 95, 
    123, 123, 180, 180, 180, 205, 215, 230, 66, 52, 65, 97, 150, 
    150, 245, 175, 66, 91, 113, 264, 175, 335, 109), drat = c(3.9, 
    3.9, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 
    3.07, 3.07, 3.07, 2.93, 3, 3.23, 4.08, 4.93, 4.22, 3.7, 2.76, 
    3.15, 3.73, 3.08, 4.08, 4.43, 3.77, 4.22, 3.62, 3.54, 4.11
    ), wt = c(2.62, 2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 
    3.15, 3.44, 3.44, 4.07, 3.73, 3.78, 5.25, 5.424, 5.345, 2.2, 
    1.615, 1.835, 2.465, 3.52, 3.435, 3.84, 3.845, 1.935, 2.14, 
    1.513, 3.17, 2.77, 3.57, 2.78), qsec = c(16.46, 17.02, 18.61, 
    19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3, 18.9, 17.4, 17.6, 
    18, 17.98, 17.82, 17.42, 19.47, 18.52, 19.9, 20.01, 16.87, 
    17.3, 15.41, 17.05, 18.9, 16.7, 16.9, 14.5, 15.5, 14.6, 18.6
    ), vs = c(0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 
    0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1), am = c(1, 
    1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 
    0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1), gear = c(4, 4, 4, 3, 
    3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 
    3, 3, 4, 5, 5, 5, 5, 5, 4), carb = c(4, 4, 1, 1, 2, 1, 4, 
    2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, 4, 2, 1, 
    2, 2, 4, 6, 8, 2)), .Names = c("mpg", "cyl", "disp", "hp", 
"drat", "wt", "qsec", "vs", "am", "gear", "carb"), row.names = c("Mazda RX4", 
"Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout", 
"Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280", 
"Merc 280C", "Merc 450SE", "Merc 450SL", "Merc 450SLC", "Cadillac Fleetwood", 
"Lincoln Continental", "Chrysler Imperial", "Fiat 128", "Honda Civic", 
"Toyota Corolla", "Toyota Corona", "Dodge Challenger", "AMC Javelin", 
"Camaro Z28", "Pontiac Firebird", "Fiat X1-9", "Porsche 914-2", 
"Lotus Europa", "Ford Pantera L", "Ferrari Dino", "Maserati Bora", 
"Volvo 142E"), class = "data.frame")
a <- 1
jennybc commented 9 years ago

Thanks @jimhester for these constructive ideas. I will come back to this, once my teaching slows down ...

jennybc commented 7 years ago

Having used this for >1 year, I think the current modus operandi, though simple, works pretty well. We could reopen this later but I'm closing for now. Not a near-term goal.