tidyverse / multidplyr

A dplyr backend that partitions a data frame over multiple processes
https://multidplyr.tidyverse.org
Other
641 stars 75 forks source link

multidplyr doesn't like unnest_longer() #133

Open xabriel opened 2 years ago

xabriel commented 2 years ago

unnest_longer(), function of tidyr, fails with what seems a red herring error message.

library(tidyverse)
library(multidplyr)

data <-
  tibble(
    list_col = list(list("a", "b"), list("c", "d")),
    int = c(1, 1)
  )

# unnest_longer works as expected w/o multidplyr
data %>%
  unnest_longer(col = list_col, values_to = "unlisted")
#> # A tibble: 4 × 2
#>   unlisted   int
#>   <chr>    <dbl>
#> 1 a            1
#> 2 b            1
#> 3 c            1
#> 4 d            1

cluster <- new_cluster(2)
data_partitioned <- data %>%
  partition(cluster)

# multidplyr fails
data_partitioned %>%
  unnest_longer(col = list_col, values_to = "unlisted")
#> Error: object 'list_col' not found

rlang::last_trace()
#> <error/rlang_error>
#> object 'list_col' not found
#> Backtrace:
#>     █
#>  1. ├─data_partitioned %>% unnest_longer(col = list_col, values_to = "unlisted")
#>  2. ├─tidyr::unnest_longer(., col = list_col, values_to = "unlisted")
#>  3. │ └─tidyselect::vars_pull(names(data), !!enquo(col))
#>  4. │   ├─tidyselect:::instrument_base_errors(...)
#>  5. │   │ └─base::withCallingHandlers(...)
#>  6. │   └─rlang::eval_tidy(enquo(var), set_names(seq_along(vars), vars))
#>  7. └─base::.handleSimpleError(...)
#>  8.   └─tidyselect:::h(simpleError(msg, call))
#> <error/simpleError>
#> object 'list_col' not found

Created on 2022-03-02 by the reprex package (v2.0.1)

xabriel commented 2 years ago

I work around this by using collect() before using unnest_longer() and then doing partition(cluster) again when I need the parallelism.

Here is a repro from my project using multidplyr that shows the workaround for this issue as well as #132:

https://github.com/xabriel/wikipedia-vandalism/blob/e60c1925ffe8ebac98919e809046e4a77fd85f5c/wikipedia-vandalism.R#L378-L425

hadley commented 11 months ago

Thanks for the suggestion! Will definitely consider it when I'm next working on multidplyr.