tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.75k stars 2.12k forks source link

Documentation request: .data pronoun #3734

Closed cfhammill closed 4 years ago

cfhammill commented 6 years ago

I encountered an interesting feature in dplyr/tibble which more or less boils down to me completely misunderstanding the .data pronoun.

data_frame(a) %>% mutate(b = names(.data), c = names(.data)[2])

yields

# A tibble: 1 x 3
      a b     c    
  <dbl> <chr> <chr>
1     5 a     a    

when I had expected

# A tibble: 1 x 3
      a b     c    
  <dbl> <chr> <chr>
1     5 a     b    

To investigate, I added a print statement and noticed that .data had "column" b before column a.

This made me realize that .data is not in fact the current state of the tibble, but an environment containing the same data. This also explains why incremental purrr::pmap with .data doesn't work inside a mutate.

Is this documented somewhere? If not could it be?

Also, lower priority, but could a pronoun that does offer the current state of the tibble be offered? Chaining pmap's inside a mutate would be handy.

krlmlr commented 6 years ago

Thanks. The .data pronoun is an object with an array of S3 methods, see eval-tidy.R in rlang. I don't see a [ method, though, and the order of column names doesn't seem to be preserved currently.

@lionel-: Would it be possible to keep track of the order of the names in an rlang_data_pronoun ?

lionel- commented 6 years ago

Probably not easy? The dplyr .data pronoun is created as an environment in dplyr IIRC.

krlmlr commented 6 years ago

I'm seeing an "rlang_data_pronoun":

library(tidyverse)
tibble(a = 1, b = 2) %>% transmute(c = list(.data)) %>% pull() %>% dput()
#> list(structure(list(src = <environment>, lookup_msg = "Column `%s` not found in `.data`", 
#>     read_only = TRUE), class = "rlang_data_pronoun"))

Created on 2018-08-02 by the reprex package (v0.2.0).

lionel- commented 6 years ago

yes, the environment you see in there is created by dplyr

krlmlr commented 6 years ago

Via new_data_mask() ? Can a data mask keep track of the order of column names? If not yet, would you consider it, so that e.g. names() returns the correct order and we can provide as_tibble() ?

lionel- commented 6 years ago

This would force dplyr to use data pronoun methods to modify the data environment. I'm not sure the extra complexity is reasonable?

krlmlr commented 6 years ago

Currently the data mask is built from scratch in each mutate() step. I think it's reasonable to use a documented API to create/modify a data structure private to rlang.?

lionel- commented 6 years ago

Is it? It seems like this will interact with plans to use delayed bindings or altenvs and so on.

Edit: The current active bindings implementation poses the same issues.

krlmlr commented 6 years ago

This issue is about the .data pronoun, which returns an rlang data structure, and always (?) will, even if we change to ALTENV or other techniques. (Can we use an ALTENV for the data mask?)

lionel- commented 6 years ago

yeah and the pronoun wraps an environment in the dplyr case. Again, I'm not sure adding an API to the pronoun to keep track of variable ordering is worth the extra complexity in both rlang and dplyr code.

lionel- commented 6 years ago

Perhaps dplyr should derive from rlang_data_pronoun and provide a names() method that would pick up the current names with the correct ordering?

cfhammill commented 6 years ago

I think tracking the names would solve my trivial example, but not fix chained pmap. A data_mask to tibble method would solve my problem, that way the internals can stay the same and a user just needs to cast if they want to use .data as a tibble.

I still think this is primarily a documentation problem, ?eval_tidy claims .data refers to the data argument which a user coming from dplyr would have no way of knowing is a data_mask, not a tibble. Also if you're using .data in dplyr there is no way of knowing it is documented in rlang::eval_tidy. So there's a discoverability problem.

krlmlr commented 6 years ago

@lionel-: A subclass defined by dplyr would be one way to handle it, I keep wondering if ordered data masks are a thing elsewhere.

@cfhammill: For the "data mask to tibble" method we need to keep track of the ordering. Always happy to review documentation updates ;-)

lionel- commented 6 years ago

Ordered data masks wrapping environments would mean we need to think about what happens when the ordering vector gets out of sync with the environment bindings etc. It doesn't feel right to me. Creating data masks with environments is pretty advanced so it seems ok to let the author implement a names method if they are already keeping track of a particular ordering.

cfhammill commented 6 years ago

@krlmlr, I think I know where in the programming vignette I should add it. And maybe add a .data object doc page

lionel- commented 6 years ago

@cfhammill It seems we can fix the ordering problem so no mention in the documentation would be necessary. Or perhaps about pmap()? This problem will be fixed in time as well with the vctrs infrastructure but it might still be half a year ahead.

hadley commented 6 years ago

I'd say this use of .data is not supported - it's contract is to provide $ and [[ methods. This is all we ever use in examples.

lionel- commented 6 years ago

Should methods like names() and length() throw an error?

lionel- commented 6 years ago

if you're using .data in dplyr there is no way of knowing it is documented in rlang::eval_tidy. So there's a discoverability problem.

I think the intended documentation point is ?dplyr::.data though it's not mention there. The documentation lives in ?rlang::.data.

cfhammill commented 6 years ago

@hadley: That's totally reasonable, provided it is documented somewhere. I worry it does somewhat violate the principle of least surprise that .data is not a frame-like object given data arguments are almost always frames in tidyverse packages. Also multiple incremental pmap's in a single mutate would be handy and simplify row-wise workflows.

@lionel-: I didn't see .data in rlang, good to know it is there. Given that is the case maybe if the programming vignette points to ?rlang::.data then everything is fine. I think the eval_tidy page that @krlmlr suggests appears to have more detail than the .data doc. But these are simple things that I could PR if the consensus is that this is a good idea.

lionel- commented 6 years ago

@cfhammill I'm currently working on the tidyeval doc so I think it's easier if I do it myself. Thanks for pointing us to that issue and for proposing your help!

cfhammill commented 6 years ago

even better, thanks @lionel-!

hadley commented 4 years ago

Remained of issue now at https://github.com/r-lib/rlang/issues/892