Closed cfhammill closed 4 years ago
Thanks. The .data
pronoun is an object with an array of S3 methods, see eval-tidy.R
in rlang. I don't see a [
method, though, and the order of column names doesn't seem to be preserved currently.
@lionel-: Would it be possible to keep track of the order of the names in an rlang_data_pronoun
?
Probably not easy? The dplyr .data
pronoun is created as an environment in dplyr IIRC.
I'm seeing an "rlang_data_pronoun"
:
library(tidyverse)
tibble(a = 1, b = 2) %>% transmute(c = list(.data)) %>% pull() %>% dput()
#> list(structure(list(src = <environment>, lookup_msg = "Column `%s` not found in `.data`",
#> read_only = TRUE), class = "rlang_data_pronoun"))
Created on 2018-08-02 by the reprex package (v0.2.0).
yes, the environment you see in there is created by dplyr
Via new_data_mask()
? Can a data mask keep track of the order of column names? If not yet, would you consider it, so that e.g. names()
returns the correct order and we can provide as_tibble()
?
This would force dplyr to use data pronoun methods to modify the data environment. I'm not sure the extra complexity is reasonable?
Currently the data mask is built from scratch in each mutate()
step. I think it's reasonable to use a documented API to create/modify a data structure private to rlang.?
Is it? It seems like this will interact with plans to use delayed bindings or altenvs and so on.
Edit: The current active bindings implementation poses the same issues.
This issue is about the .data
pronoun, which returns an rlang data structure, and always (?) will, even if we change to ALTENV or other techniques. (Can we use an ALTENV for the data mask?)
yeah and the pronoun wraps an environment in the dplyr case. Again, I'm not sure adding an API to the pronoun to keep track of variable ordering is worth the extra complexity in both rlang and dplyr code.
Perhaps dplyr should derive from rlang_data_pronoun
and provide a names()
method that would pick up the current names with the correct ordering?
I think tracking the names would solve my trivial example, but not fix chained pmap
. A data_mask
to tibble
method would solve my problem, that way the internals can stay the same and a user just needs to cast if they want to use .data
as a tibble
.
I still think this is primarily a documentation problem, ?eval_tidy
claims .data
refers to the data
argument which a user coming from dplyr
would have no way of knowing is a data_mask
, not a tibble
. Also if you're using .data
in dplyr
there is no way of knowing it is documented in rlang::eval_tidy
. So there's a discoverability problem.
@lionel-: A subclass defined by dplyr would be one way to handle it, I keep wondering if ordered data masks are a thing elsewhere.
@cfhammill: For the "data mask to tibble" method we need to keep track of the ordering. Always happy to review documentation updates ;-)
Ordered data masks wrapping environments would mean we need to think about what happens when the ordering vector gets out of sync with the environment bindings etc. It doesn't feel right to me. Creating data masks with environments is pretty advanced so it seems ok to let the author implement a names method if they are already keeping track of a particular ordering.
@krlmlr, I think I know where in the programming
vignette I should add it. And maybe add a .data
object doc page
@cfhammill It seems we can fix the ordering problem so no mention in the documentation would be necessary. Or perhaps about pmap()
? This problem will be fixed in time as well with the vctrs infrastructure but it might still be half a year ahead.
I'd say this use of .data
is not supported - it's contract is to provide $
and [[
methods. This is all we ever use in examples.
Should methods like names()
and length()
throw an error?
if you're using .data in dplyr there is no way of knowing it is documented in rlang::eval_tidy. So there's a discoverability problem.
I think the intended documentation point is ?dplyr::.data
though it's not mention there. The documentation lives in ?rlang::.data
.
@hadley: That's totally reasonable, provided it is documented somewhere. I worry it does somewhat violate the principle of least surprise that .data
is not a frame-like object given data arguments are almost always frames in tidyverse packages. Also multiple incremental pmap
's in a single mutate would be handy and simplify row-wise workflows.
@lionel-: I didn't see .data
in rlang
, good to know it is there. Given that is the case maybe if the programming vignette points to ?rlang::.data
then everything is fine. I think the eval_tidy
page that @krlmlr suggests appears to have more detail than the .data
doc. But these are simple things that I could PR if the consensus is that this is a good idea.
@cfhammill I'm currently working on the tidyeval doc so I think it's easier if I do it myself. Thanks for pointing us to that issue and for proposing your help!
even better, thanks @lionel-!
Remained of issue now at https://github.com/r-lib/rlang/issues/892
I encountered an interesting feature in
dplyr
/tibble
which more or less boils down to me completely misunderstanding the.data
pronoun.data_frame(a) %>% mutate(b = names(.data), c = names(.data)[2])
yields
when I had expected
To investigate, I added a print statement and noticed that
.data
had "column"b
before columna
.This made me realize that
.data
is not in fact the current state of thetibble
, but anenvironment
containing the same data. This also explains why incrementalpurrr::pmap
with.data
doesn't work inside amutate
.Is this documented somewhere? If not could it be?
Also, lower priority, but could a pronoun that does offer the current state of the
tibble
be offered? Chainingpmap
's inside amutate
would be handy.