Closed dgrtwo closed 7 years ago
Can't it somehow automatically deduce that y
is a variable whose definition has to be included in the reprex ?
@romainfrancois That's a terrific idea. Unfortunately I can't think of an approach that would do that, at least for longer and more detailed reproducible examples. (Outside of something silly like "run the reprex again and again, and each time it gets an exception for a variable not found, add that variable).
Do you have any ideas? Maybe something like creating a lazy version of the user's environment, evaluating the expression, and seeing which variables were retrieved?
In a way, it shares some properties with what we do in dplyr
's hybrid evaluation.
It's a lot of massaging the expression internally, but it's not the most bulletproof part of it, and indeed it gets harder with examples beyond hello world.
Especially consider reprexes that contain dplyr or ggplot2 code with NSE. Variable definitions may not be immediately apparent but they don't need to be defined.
I think a robust solution would have to involve running the code and not just inspecting it
On Aug 26, 2015, at 5:54 PM, Romain François notifications@github.com wrote:
In a way, it shares some properties with what we do in dplyr's hybrid evaluation.
It's a lot of massaging the expression internally, but it's not the most bulletproof part of it, and indeed it gets harder with examples beyond hello world.
— Reply to this email directly or view it on GitHub.
(Fun fact; after I finished this example I used reprex to share it. Works like a charm!)
Yay this!
The discovery and reverse engineering of required objects seems really cool ... and hard. I think the main function should remain fairly dumb on this point, to keep things simple. At the risk of being paternalistic, I also think it's healthy to make the user keep editing the code until they've created a proper reprex.
I'd say any efforts to save user from him/herself should go in a distinct function.
@mrdwab, from the overflow
package reached out in #9. This thread and the sodput()
function got me thinking. I could imagine having a mechanism for the local code to include something like reprex_dput(<R code to create an object, not to be shared with the world>)
and only the dput()
command would appear in the result. Does that have value?
Hmm, I think writing the code to create an object
is the actual pain point. The problem is not so much that you don't want to share that code as that you don't want to write it. Many users already have a variable x or y or whatever, have run into a problem, and want to share those variables along with it.
I do think someone could include in the reprex code
reprex_vars(x, y, z)
And then it would do something similar to my define_vars
function above. (However, you'd have to be smart about allowing reprex_vars
to work with the reprexer's environment, but for everything else to work in a clean environment. See #11 )
@dgrtwo Actually finding the dependencies automatically isn’t that hard — at least for objects it’s a simple matter of iterating through the parse tree of the expression passed to reprex
and collect all variables. And R already has a function to do this — all.vars
(it’s a tad more complicated than that, of course).
This doesn’t work for all objects though because not all can be correctly reproduced via dput
. In addition, this would require collecting used, attached packages; used R options; and hidden global variables (such as the random number seed). In fact, I think the cache
option in knitr has to do something similar and the respective code could probably be stolen.
However, I understand @jennybc’s concerns. Still, this would be interesting as a separate function/package.
@klmr But that approach wouldn't work for non-standard evaluation, as in ggplot2's aes or dplyr (or data.table, etc). We could fix some of it by saying "use only vars that are present in the caller's environment," though that could still lead to false positives if a column name matches an environment variable.
(User-defined functions would also be missed, though that's not a big deal)
I hadn’t considered NSE, true. Functions aren’t a problem: all.vars
can handle them.
Ah, I had been thinking about how if we included functions it would also try to include functions from packages. But now that I think of it, for each function name in the caller's environment, we could get
it and see whether it's attached from a package.
I'm coming around to the idea of auto-finding the variables, though I think we'd need to
~
and aes
, that commonly contain NSE. dplyr is harder (we could include a list of dplyr's NSE functions, but it's a bit odd to be so specific), and data.table would be a real drag.reprex({y + 2}, vars = list(y))
)Made some progress on parsing code, and figuring out from it what variables need to be dputted and what packages need to be loaded. I think we can cover about 95% of reproducible examples this way.
Posting it here for now, more work later.
(Also set up a start of a package here) to handle the logic of parsing expressions into a data frame)
For detecting global variables in an expression you may want to look at codetools::findGlobals()
and the pryr::f
implementation, see https://github.com/hadley/pryr/blob/87bec0fede7da6f42a0c65b424f3ba33743fb8fd/R/f.r#L21-L23.
An alternative idea to detecting what variables are missing is to use an .()
annotation like I did in my lambda package, (https://github.com/jimhester/lambda/blob/e493d10459eb356fabe66bb4bc935854cc3f173b/R/package.R)
This would let you write something like
x <- data.frame(num = 1:10, alpha = letters[1:10])
reprex({ subset(.(x), num > 5) })
Although that is mostly a syntactic convenience, I think it would actually be best to just list them explicitly with vals =
as suggested.
reprex({ subset(x, num > 5) }, vals = x)
Simple implementation is
reprex2 <- function(x, vals, ...) {
x <- deparse(substitute(x))
if (!missing(vals)) {
nms <- all.vars(substitute(vals))
exprs <- lapply(nms, function(x_)
paste(x_, "<-", paste(collapse = "\n",
capture.output(dput(get(x_, parent.frame()))))))
x <- c(unlist(exprs), x)
}
con <- textConnection(x)
on.exit(close(con))
reprex(infile = con, ...)
}
And example usage
x <- 1:100
y <- data.frame(num = 1:10, alpha = letters[1:10])
z <- mtcars
reprex2({a <- 1}, vals = c(x, y, z))
x <- 1:100
y <- structure(list(num = 1:10, alpha = structure(1:10, .Label = c("a",
"b", "c", "d", "e", "f", "g", "h", "i", "j"), class = "factor")), .Names = c("num",
"alpha"), row.names = c(NA, -10L), class = "data.frame")
z <- structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3,
24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4,
30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8,
19.7, 15, 21.4), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8,
8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8, 4),
disp = c(160, 160, 108, 258, 360, 225, 360, 146.7, 140.8,
167.6, 167.6, 275.8, 275.8, 275.8, 472, 460, 440, 78.7, 75.7,
71.1, 120.1, 318, 304, 350, 400, 79, 120.3, 95.1, 351, 145,
301, 121), hp = c(110, 110, 93, 110, 175, 105, 245, 62, 95,
123, 123, 180, 180, 180, 205, 215, 230, 66, 52, 65, 97, 150,
150, 245, 175, 66, 91, 113, 264, 175, 335, 109), drat = c(3.9,
3.9, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,
3.07, 3.07, 3.07, 2.93, 3, 3.23, 4.08, 4.93, 4.22, 3.7, 2.76,
3.15, 3.73, 3.08, 4.08, 4.43, 3.77, 4.22, 3.62, 3.54, 4.11
), wt = c(2.62, 2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19,
3.15, 3.44, 3.44, 4.07, 3.73, 3.78, 5.25, 5.424, 5.345, 2.2,
1.615, 1.835, 2.465, 3.52, 3.435, 3.84, 3.845, 1.935, 2.14,
1.513, 3.17, 2.77, 3.57, 2.78), qsec = c(16.46, 17.02, 18.61,
19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3, 18.9, 17.4, 17.6,
18, 17.98, 17.82, 17.42, 19.47, 18.52, 19.9, 20.01, 16.87,
17.3, 15.41, 17.05, 18.9, 16.7, 16.9, 14.5, 15.5, 14.6, 18.6
), vs = c(0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1), am = c(1,
1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1), gear = c(4, 4, 4, 3,
3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3,
3, 3, 4, 5, 5, 5, 5, 5, 4), carb = c(4, 4, 1, 1, 2, 1, 4,
2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, 4, 2, 1,
2, 2, 4, 6, 8, 2)), .Names = c("mpg", "cyl", "disp", "hp",
"drat", "wt", "qsec", "vs", "am", "gear", "carb"), row.names = c("Mazda RX4",
"Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout",
"Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280",
"Merc 280C", "Merc 450SE", "Merc 450SL", "Merc 450SLC", "Cadillac Fleetwood",
"Lincoln Continental", "Chrysler Imperial", "Fiat 128", "Honda Civic",
"Toyota Corolla", "Toyota Corona", "Dodge Challenger", "AMC Javelin",
"Camaro Z28", "Pontiac Firebird", "Fiat X1-9", "Porsche 914-2",
"Lotus Europa", "Ford Pantera L", "Ferrari Dino", "Maserati Bora",
"Volvo 142E"), class = "data.frame")
a <- 1
Thanks @jimhester for these constructive ideas. I will come back to this, once my teaching slows down ...
Having used this for >1 year, I think the current modus operandi, though simple, works pretty well. We could reopen this later but I'm closing for now. Not a near-term goal.
One common thing users forget to do in a reproducible example is to share the variables they've already defined (for example, datasets they're working with). You can turn a variable into a reproducible chunk of code with
dput
.If we already have a few variables defined, it would be good to be able to add them to the start of the reprex code automatically. Here's a hello-world version using lazy_eval:
This takes one or more variables as its arguments and turns them into code that defines those variables. For example:
In the end version of reprex, I imagine having a
vars
function, which creates an argument suitable for passing in:Or just wrap with a list (though that takes a little expression-parsing internally to get the names out):
If you like this feature idea I'll add it to the next PR.
(Fun fact; after I finished this example I used reprex to share it. Works like a charm!)