tidyverse / purrr

A functional programming toolkit for R
https://purrr.tidyverse.org/
Other
1.27k stars 271 forks source link

map() call in dplyr::mutate() error while standalone map() call works #552

Closed leungi closed 6 years ago

leungi commented 6 years ago

Similar to issue #541, but now on map(); reprex below.

library(tidyverse)

## try 1: unexpected error
mtcars %>% 
  group_by(cyl) %>% 
  nest() -> nest_data

nest_data %>% 
  mutate(data = map(data, ~.x %>% 
                      mutate(row_id = row_number())))
#> Error in mutate_impl(.data, dots): Evaluation error: Column `row_id` must be length 7 (the number of rows) or one, not 3.

##  try 2: custom function
nest_data %>% 
  mutate(data = map(data, function(x) {x %>% mutate(row_id = row_number())}))
#> Error in mutate_impl(.data, dots): Evaluation error: Column `row_id` must be length 7 (the number of rows) or one, not 3.

##  try 3: map_dfr() standalone works, though missing "cyl" group variables, but can be joined back
map_dfr(nest_data$data, function(x) {x %>% mutate(row_id = row_number())})

#> # A tibble: 32 x 11
#>      mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb row_id
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <int>
#>  1  21    160    110  3.9   2.62  16.5     0     1     4     4      1
#>  2  21    160    110  3.9   2.88  17.0     0     1     4     4      2
#>  3  21.4  258    110  3.08  3.22  19.4     1     0     3     1      3
#>  4  18.1  225    105  2.76  3.46  20.2     1     0     3     1      4
#>  5  19.2  168.   123  3.92  3.44  18.3     1     0     4     4      5
#>  6  17.8  168.   123  3.92  3.44  18.9     1     0     4     4      6
#>  7  19.7  145    175  3.62  2.77  15.5     0     1     5     6      7
#>  8  22.8  108     93  3.85  2.32  18.6     1     1     4     1      1
#>  9  24.4  147.    62  3.69  3.19  20       1     0     4     2      2
#> 10  22.8  141.    95  3.92  3.15  22.9     1     0     4     2      3
#> # ... with 22 more rows

@cderv, I tried your previous solution by explicitly defining function, but no luck.

cderv commented 6 years ago

this is a special corner case due to row_number evaluation. You have an error because it is evaluated as if it applied to the first mutate applying on nested_data which is indeed 3 rows long. Either it is on purpose or there is something off in the tidy evaluation and all the scope things.

Eitherway, you currently have workaround to achieve what you want

library(tidyverse)

nested_data <- mtcars %>% 
  group_by(cyl) %>% 
  nest()

Working with a custom defined function works as expected.

add_row_number <- function(x) {x %>% mutate(row_id = row_number())}
nest_data %>% 
  mutate(data = map(data, add_row_number)) %>%
  {.[[1,2]]}
#> # A tibble: 7 x 11
#>     mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb row_id
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <int>
#> 1  21    160    110  3.9   2.62  16.5     0     1     4     4      1
#> 2  21    160    110  3.9   2.88  17.0     0     1     4     4      2
#> 3  21.4  258    110  3.08  3.22  19.4     1     0     3     1      3
#> 4  18.1  225    105  2.76  3.46  20.2     1     0     3     1      4
#> 5  19.2  168.   123  3.92  3.44  18.3     1     0     4     4      5
#> 6  17.8  168.   123  3.92  3.44  18.9     1     0     4     4      6
#> 7  19.7  145    175  3.62  2.77  15.5     0     1     5     6      7

In your example you defined an anonymous function. Which is different as tidyeval will deal with all the anymous function call whereas here it deals only with the defined function name then evaluates. row_number is evaluated in the correct context.

Not using row_number is also an option. here rownames_to_column can be used because without any defined rownames, row number are used as rownames

nest_data %>% 
  mutate(data = map(data, ~.x %>% rownames_to_column("row_id"))) %>%
  {.[[1,2]]}
#> # A tibble: 7 x 11
#>   row_id   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1       21    160    110  3.9   2.62  16.5     0     1     4     4
#> 2 2       21    160    110  3.9   2.88  17.0     0     1     4     4
#> 3 3       21.4  258    110  3.08  3.22  19.4     1     0     3     1
#> 4 4       18.1  225    105  2.76  3.46  20.2     1     0     3     1
#> 5 5       19.2  168.   123  3.92  3.44  18.3     1     0     4     4
#> 6 6       17.8  168.   123  3.92  3.44  18.9     1     0     4     4
#> 7 7       19.7  145    175  3.62  2.77  15.5     0     1     5     6

Created on 2018-09-20 by the reprex package (v0.2.0).

I think there may be something to fix here, but it require digging into tidyeval mechanism inside dplyr and it is not trival for me. Moreover because I think it is somewhere in mutate C++ code.

lionel- commented 6 years ago

I think this is fixed with dev dplyr thanks to @romainfrancois's new hybrid-eval implementation.

cderv commented 6 years ago

Oh I forgot about that. Indeed it is working!

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)
library(tidyr)
packageVersion("dplyr")
#> [1] '0.7.99.9000'

mtcars %>% 
  group_by(cyl) %>% 
  nest() -> nest_data

nest_data %>% 
  mutate(data = map(data, ~.x %>% 
                      mutate(row_id = row_number())))
#> # A tibble: 3 x 2
#>     cyl data              
#>   <dbl> <list>            
#> 1     6 <tibble [7 × 11]> 
#> 2     4 <tibble [11 × 11]>
#> 3     8 <tibble [14 × 11]>

Created on 2018-09-20 by the reprex package (v0.2.0).

leungi commented 6 years ago

Thanks for the prompt reply and solution @cderv @lionel- !

Just got chance to update dplyr to dev version and working as described.

As usual, learnt another new tip from the greats :+1: