tidyverse / dtplyr

Data table backend for dplyr
https://dtplyr.tidyverse.org
Other
670 stars 57 forks source link

Bug in `.by` argument of `mutate()` after selecting columns #439

Closed UchidaMizuki closed 1 year ago

UchidaMizuki commented 1 year ago

Errors occur if .by is specified when mutate() after selecting columns. In the following example, a data frame with 3 columns (Species, Sepal.Length, Sepal.Length_mean) should be output.

The second reprex without select() does not cause an error.

library(tidyverse)
library(dtplyr)

iris %>%
  lazy_dt() %>%
  select(Species, Sepal.Length) %>%
  mutate(Sepal.Length_mean = mean(Sepal.Length),
         .by = Species) %>%
  collect() %>%
  head()
#> Error in `as_tibble()`:
#> ! Column name `Species` must not be duplicated.
#> Use `.name_repair` to specify repair.
#> Caused by error in `repaired_names()`:
#> ! Names must be unique.
#> ✖ These names are duplicated:
#>   * "Species" at locations 1 and 2.
#> Backtrace:
#>      ▆
#>   1. ├─... %>% head()
#>   2. ├─utils::head(.)
#>   3. ├─dplyr::collect(.)
#>   4. └─dtplyr:::collect.dtplyr_step(.)
#>   5.   ├─tibble::as_tibble(x)
#>   6.   └─dtplyr:::as_tibble.dtplyr_step(x)
#>   7.     ├─tibble::as_tibble(dt_eval(x), .name_repair = .name_repair)
#>   8.     └─tibble:::as_tibble.data.frame(dt_eval(x), .name_repair = .name_repair)
#>   9.       └─tibble:::lst_to_tibble(unclass(x), .rows, .name_repair)
#>  10.         └─tibble:::set_repaired_names(...)
#>  11.           └─tibble:::repaired_names(...)
#>  12.             ├─tibble:::subclass_name_repair_errors(...)
#>  13.             │ └─base::withCallingHandlers(...)
#>  14.             └─vctrs::vec_as_names(...)
#>  15.               └─vctrs (local) `<fn>`()
#>  16.                 └─vctrs:::validate_unique(names = names, arg = arg, call = call)
#>  17.                   └─vctrs:::stop_names_must_be_unique(names, arg, call = call)
#>  18.                     └─vctrs:::stop_names(...)
#>  19.                       └─vctrs:::stop_vctrs(...)
#>  20.                         └─rlang::abort(message, class = c(class, "vctrs_error"), ..., call = call)

Created on 2023-07-27 with reprex v2.0.2

library(tidyverse)
library(dtplyr)

iris %>%
  lazy_dt() %>%
  mutate(Sepal.Length_mean = mean(Sepal.Length),
         .by = Species) %>%
  collect() %>%
  head()
#> # A tibble: 6 × 6
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_mean
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>               <dbl>
#> 1          5.1         3.5          1.4         0.2 setosa               5.01
#> 2          4.9         3            1.4         0.2 setosa               5.01
#> 3          4.7         3.2          1.3         0.2 setosa               5.01
#> 4          4.6         3.1          1.5         0.2 setosa               5.01
#> 5          5           3.6          1.4         0.2 setosa               5.01
#> 6          5.4         3.9          1.7         0.4 setosa               5.01

Created on 2023-07-27 with reprex v2.0.2

markfairbanks commented 1 year ago

Looks like a keyby arg is getting added to the select() call for some reason.

library(dplyr)
library(dtplyr)

iris %>%
  lazy_dt() %>%
  select(Species, Sepal.Length) %>%
  mutate(Sepal.Length_mean = mean(Sepal.Length),
         .by = Species)
#> Source: local data table [150 x 4]
#> Call:   `_DT1`[, .(Species, Sepal.Length), keyby = .(Species)][, `:=`(Sepal.Length_mean = mean(Sepal.Length)), 
#>     by = .(Species)]
#> 
#>   Species Species Sepal.Length Sepal.Length_mean
#>   <fct>   <fct>          <dbl>             <dbl>
#> 1 setosa  setosa           5.1              5.01
#> 2 setosa  setosa           4.9              5.01
#> 3 setosa  setosa           4.7              5.01
#> 4 setosa  setosa           4.6              5.01
#> 5 setosa  setosa           5                5.01
#> 6 setosa  setosa           5.4              5.01
#> # ℹ 144 more rows
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
markfairbanks commented 1 year ago

All fixed - thanks for catching this.

# Install dev version
# pak::pak("tidyverse/dtplyr")

library(dtplyr)
library(dplyr, warn.conflicts = FALSE)

iris %>%
  lazy_dt() %>%
  select(Species, Sepal.Length) %>%
  mutate(Sepal.Length_mean = mean(Sepal.Length),
         .by = Species) %>%
  collect() %>%
  head()
#> # A tibble: 6 × 3
#>   Species Sepal.Length Sepal.Length_mean
#>   <fct>          <dbl>             <dbl>
#> 1 setosa           5.1              5.01
#> 2 setosa           4.9              5.01
#> 3 setosa           4.7              5.01
#> 4 setosa           4.6              5.01
#> 5 setosa           5                5.01
#> 6 setosa           5.4              5.01